{
"cells": [
{
"cell_type": "markdown",
"id": "db5f4f9a-7776-42b3-8758-85624d4c15ea",
"metadata": {
"id": "db5f4f9a-7776-42b3-8758-85624d4c15ea"
},
"source": [
""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "21e9eafb",
"metadata": {
"id": "21e9eafb"
},
"source": [
"[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/08.1.Automatic_Question_Generation_Legal_Texts.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "9859b3bc-cec4-4189-88ed-37add5484623",
"metadata": {
"id": "9859b3bc-cec4-4189-88ed-37add5484623"
},
"source": [
"# Answering Questions on Legal Texts\n",
"One of the latests biggest outcomes in NLP are **Language Models** and their ability to answer questions, expressed in natural language."
]
},
{
"cell_type": "markdown",
"id": "__BAKoJW8zVv",
"metadata": {
"id": "__BAKoJW8zVv"
},
"source": [
"> *In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.\n",
"...\n",
" The Company hereby grants to Seller a perpetual, non-exclusive, royalty-free license.\n",
"...\n",
"On March 12, 2020, we closed a Loan and Security Agreement with Hitachi Capital American Corp (also known as \"Hitachi\")\n",
"...*\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "uyqNQOoX8wgw",
"metadata": {
"id": "uyqNQOoX8wgw"
},
"source": [
"```\n",
"- What is the type of agreement?\n",
"- What is the type of license?\n",
"- What are the companies in the agreement?\n",
"- What is also known as the different compaines?\n",
"- Who is the recipient of a license?\n",
"````"
]
},
{
"cell_type": "markdown",
"id": "wElGMuoR7E0O",
"metadata": {
"id": "wElGMuoR7E0O"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "uvJsv20T7EPW",
"metadata": {
"id": "uvJsv20T7EPW"
},
"source": [
"\n",
"\n",
"**Question Answeering (QA)** uses specific Language Models trained to carry out **Natural Language Inference (NLI)**\n",
"\n",
"**NLI** works as follows:\n",
"- Given a text as a Premise (P);\n",
"- Given a hypotheses (H) as a question to be solved;\n",
" - Then, we ask the Language Model is H is `entailed`, `contradicted` or `not related` in P. \n"
]
},
{
"cell_type": "markdown",
"id": "uenfXatl-dAR",
"metadata": {
"id": "uenfXatl-dAR"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "fyNazopzASEM",
"metadata": {
"id": "fyNazopzASEM"
},
"source": [
"Although we are not getting into the maths of it, it's basically done by using a Language Model to encode P, H and then carry out sentence similarity operations."
]
},
{
"cell_type": "markdown",
"id": "zjRUEO9SAQc2",
"metadata": {
"id": "zjRUEO9SAQc2"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "lI-PfAmwMntN",
"metadata": {
"id": "lI-PfAmwMntN"
},
"source": [
"# Creating questions on the fly: NerQuestionGenerator\n",
"Legal documents are known to be very long. Although you can divide the documents into paragraphs or sections, and those into sentences, the resulted sentences are still long.\n",
"\n",
"Let's take a look at this example:\n",
"\n",
"> `Buyer shall use such materials and supplies only in accordance with the present agreement`\n",
"\n",
"Let's target the extraction of the OBJECT (`such materials...`)"
]
},
{
"cell_type": "markdown",
"id": "5GR8bE5XOc62",
"metadata": {
"id": "5GR8bE5XOc62"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "apdaQgXXb6Y6",
"metadata": {
"id": "apdaQgXXb6Y6"
},
"source": [
"To do that, we can divide these kind of sentences into 3 parts:\n",
"1. The Subject (`Buyer`)\n",
"2. The Action (`shall use`)\n",
"3. The Object (what the Buyer shall use? - `such materials and supplies only in accordance with the present agreement`)"
]
},
{
"cell_type": "markdown",
"id": "YhzfLnzoPP2s",
"metadata": {
"id": "YhzfLnzoPP2s"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "wRX15G-uPBKT",
"metadata": {
"id": "wRX15G-uPBKT"
},
"source": [
"These are the steps we are going to follow:\n",
"\n",
"1. We use NER to detect easy entities as the `Subject` and the `Action`. \n",
" - Example: `Buyer - SUBJECT`, `shall use - ACTION`\n",
"\n",
"2. Automatically generate a question to ask for the `Object`, using `Subject` and `Action`;\n",
" - Example: `What shall the Buyer use?`\n",
"\n",
"3. Use the question and the sentence to retrieve `Object`\n",
" - Example: `What shall the Buyer use? such materials and supplies only in accordance with the present agreement`\n",
"\n",
"Last, but not least, it's very important to chose a domain-specific Question Answering model.\n"
]
},
{
"cell_type": "markdown",
"id": "gk3kZHmNj51v",
"metadata": {
"collapsed": false,
"id": "gk3kZHmNj51v"
},
"source": [
"# Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "_914itZsj51v",
"metadata": {
"id": "_914itZsj51v",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"! pip install -q johnsnowlabs"
]
},
{
"cell_type": "markdown",
"id": "YPsbAnNoPt0Z",
"metadata": {
"id": "YPsbAnNoPt0Z"
},
"source": [
"## Automatic Installation\n",
"Using my.johnsnowlabs.com SSO"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fY0lcShkj51w",
"metadata": {
"id": "fY0lcShkj51w",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, legal\n",
"\n",
"# nlp.install(force_browser=True)"
]
},
{
"cell_type": "markdown",
"id": "hsJvn_WWM2GL",
"metadata": {
"id": "hsJvn_WWM2GL"
},
"source": [
"## Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to my.johnsnowlabs.com\n",
"- Download your license\n",
"- Upload it using the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "i57QV3-_P2sQ",
"metadata": {
"id": "i57QV3-_P2sQ"
},
"outputs": [],
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
]
},
{
"cell_type": "markdown",
"id": "xGgNdFzZP_hQ",
"metadata": {
"id": "xGgNdFzZP_hQ"
},
"source": [
"- Install it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "OfmmPqknP4rR",
"metadata": {
"id": "OfmmPqknP4rR"
},
"outputs": [],
"source": [
"nlp.install()"
]
},
{
"cell_type": "markdown",
"id": "DCl5ErZkNNLk",
"metadata": {
"id": "DCl5ErZkNNLk"
},
"source": [
"# Starting"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wRXTnNl3j51w",
"metadata": {
"id": "wRXTnNl3j51w"
},
"outputs": [],
"source": [
"spark = nlp.start()"
]
},
{
"cell_type": "markdown",
"id": "M-XQkfU_D1ZO",
"metadata": {
"id": "M-XQkfU_D1ZO"
},
"source": [
"Let's read and normalize a little the spacing of the NLP models may think those are different sentences and get unexpected results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b342ab82",
"metadata": {
"id": "b342ab82"
},
"outputs": [],
"source": [
"text = \"\"\"The Buyer shall use such materials and supplies only in accordance with the present agreement\"\"\""
]
},
{
"cell_type": "markdown",
"id": "7fySdnKxnqnI",
"metadata": {
"id": "7fySdnKxnqnI"
},
"source": [
"#1. Extracting SUBJECT and VERB"
]
},
{
"cell_type": "markdown",
"id": "BvhTmh-EnUSm",
"metadata": {
"id": "BvhTmh-EnUSm"
},
"source": [
"## OPTION a: Use Dependency Parsing to retrieve SUBJECT and ACTION\n",
"Let's go the *grammatical* way!\n",
"\n",
"Let's use `Part of Speech` and `Dependency Parsing` to check for the SUBJECT and ACTION.\n",
"\n",
"- `PoS` retrieves morphological information of the words, like `VERB`, `NOUN`, etc.\n",
"- `Dependency Parsing` categories chunks by their grammatical role: `SUBJECT` and connects chunks together using their dependencies.\n",
"\n",
"For more information about `PoS` and `DepParsing`, please check the Spark NLP (Open Source) notebooks."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1_RKR3Bncm5e",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1_RKR3Bncm5e",
"outputId": "2e05996a-b380-4087-f77a-bb8aa5867098"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pos_anc download started this may take some time.\n",
"Approximate size to download 3.9 MB\n",
"[OK!]\n",
"dependency_conllu download started this may take some time.\n",
"Approximate size to download 16.7 MB\n",
"[OK!]\n",
"dependency_typed_conllu download started this may take some time.\n",
"Approximate size to download 2.4 MB\n",
"[OK!]\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"documentAssembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"tokenizer = nlp.Tokenizer()\\\n",
" .setInputCols(\"document\")\\\n",
" .setOutputCol(\"token\")\n",
"\n",
"pos = nlp.PerceptronModel.pretrained(\"pos_anc\", 'en')\\\n",
" .setInputCols(\"document\", \"token\")\\\n",
" .setOutputCol(\"pos\")\n",
"\n",
"dep_parser = nlp.DependencyParserModel.pretrained('dependency_conllu')\\\n",
" .setInputCols([\"document\", \"pos\", \"token\"])\\\n",
" .setOutputCol(\"dependency\")\n",
"\n",
"typed_dep_parser = nlp.TypedDependencyParserModel.pretrained('dependency_typed_conllu')\\\n",
" .setInputCols([\"token\", \"pos\", \"dependency\"])\\\n",
" .setOutputCol(\"dependency_type\")\n",
"\n",
"nlpPipeline = nlp.Pipeline(\n",
" stages=[\n",
" documentAssembler, \n",
" tokenizer,\n",
" pos,\n",
" dep_parser,\n",
" typed_dep_parser\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "REMaIM0cdQL4",
"metadata": {
"id": "REMaIM0cdQL4"
},
"outputs": [],
"source": [
"text_df = spark.createDataFrame([[text]]).toDF(\"text\")\n",
"\n",
"fit_model = nlpPipeline.fit(text_df)\n",
"\n",
"result = fit_model.transform(text_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "FgZNZfLSdS8w",
"metadata": {
"id": "FgZNZfLSdS8w"
},
"outputs": [],
"source": [
"from pyspark.sql import functions as F\n",
"result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n",
" result.token.begin, \n",
" result.token.end, \n",
" result.dependency.result, \n",
" result.dependency_type.result,\n",
" result.pos.result)).alias(\"cols\")) \\\n",
" .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n",
" F.expr(\"cols['1']\").alias(\"begin\"),\n",
" F.expr(\"cols['2']\").alias(\"end\"),\n",
" F.expr(\"cols['3']\").alias(\"dependency\"),\n",
" F.expr(\"cols['4']\").alias(\"dependency_type\"),\n",
" F.expr(\"cols['5']\").alias(\"PoS\")).toPandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ofZv-x-DgKxA",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 520
},
"id": "ofZv-x-DgKxA",
"outputId": "2f4b5189-5a92-4db7-b838-83be2bcc67f3"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
chunk
\n",
"
begin
\n",
"
end
\n",
"
dependency
\n",
"
dependency_type
\n",
"
PoS
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
The
\n",
"
0
\n",
"
2
\n",
"
Buyer
\n",
"
nsubj
\n",
"
DT
\n",
"
\n",
"
\n",
"
1
\n",
"
Buyer
\n",
"
4
\n",
"
8
\n",
"
use
\n",
"
nsubj
\n",
"
NNP
\n",
"
\n",
"
\n",
"
2
\n",
"
shall
\n",
"
10
\n",
"
14
\n",
"
use
\n",
"
appos
\n",
"
MD
\n",
"
\n",
"
\n",
"
3
\n",
"
use
\n",
"
16
\n",
"
18
\n",
"
ROOT
\n",
"
root
\n",
"
VB
\n",
"
\n",
"
\n",
"
4
\n",
"
such
\n",
"
20
\n",
"
23
\n",
"
materials
\n",
"
amod
\n",
"
JJ
\n",
"
\n",
"
\n",
"
5
\n",
"
materials
\n",
"
25
\n",
"
33
\n",
"
use
\n",
"
nsubj
\n",
"
NNS
\n",
"
\n",
"
\n",
"
6
\n",
"
and
\n",
"
35
\n",
"
37
\n",
"
supplies
\n",
"
cc
\n",
"
CC
\n",
"
\n",
"
\n",
"
7
\n",
"
supplies
\n",
"
39
\n",
"
46
\n",
"
materials
\n",
"
nsubj
\n",
"
NNS
\n",
"
\n",
"
\n",
"
8
\n",
"
only
\n",
"
48
\n",
"
51
\n",
"
accordance
\n",
"
amod
\n",
"
RB
\n",
"
\n",
"
\n",
"
9
\n",
"
in
\n",
"
53
\n",
"
54
\n",
"
accordance
\n",
"
det
\n",
"
IN
\n",
"
\n",
"
\n",
"
10
\n",
"
accordance
\n",
"
56
\n",
"
65
\n",
"
materials
\n",
"
amod
\n",
"
NN
\n",
"
\n",
"
\n",
"
11
\n",
"
with
\n",
"
67
\n",
"
70
\n",
"
agreement
\n",
"
det
\n",
"
IN
\n",
"
\n",
"
\n",
"
12
\n",
"
the
\n",
"
72
\n",
"
74
\n",
"
agreement
\n",
"
nsubj
\n",
"
DT
\n",
"
\n",
"
\n",
"
13
\n",
"
present
\n",
"
76
\n",
"
82
\n",
"
agreement
\n",
"
amod
\n",
"
JJ
\n",
"
\n",
"
\n",
"
14
\n",
"
agreement
\n",
"
84
\n",
"
92
\n",
"
accordance
\n",
"
flat
\n",
"
NN
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunk begin end dependency dependency_type PoS\n",
"0 The 0 2 Buyer nsubj DT\n",
"1 Buyer 4 8 use nsubj NNP\n",
"2 shall 10 14 use appos MD\n",
"3 use 16 18 ROOT root VB\n",
"4 such 20 23 materials amod JJ\n",
"5 materials 25 33 use nsubj NNS\n",
"6 and 35 37 supplies cc CC\n",
"7 supplies 39 46 materials nsubj NNS\n",
"8 only 48 51 accordance amod RB\n",
"9 in 53 54 accordance det IN\n",
"10 accordance 56 65 materials amod NN\n",
"11 with 67 70 agreement det IN\n",
"12 the 72 74 agreement nsubj DT\n",
"13 present 76 82 agreement amod JJ\n",
"14 agreement 84 92 accordance flat NN"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result_df"
]
},
{
"cell_type": "markdown",
"id": "quItUUehPdGP",
"metadata": {
"id": "quItUUehPdGP"
},
"source": [
"### Let's visualize with Spark NLP Display\n",
"To do that, we need the results of a LightPipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4w22wp1JPipo",
"metadata": {
"id": "4w22wp1JPipo"
},
"outputs": [],
"source": [
"lp = nlp.LightPipeline(fit_model)\n",
"pipeline_result = lp.fullAnnotate(text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "yQEth2wgO7db",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 422
},
"id": "yQEth2wgO7db",
"outputId": "eb088b2d-1439-432c-fafd-010b62acf983"
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sparknlp_display import DependencyParserVisualizer\n",
"\n",
"dependency_vis = DependencyParserVisualizer()\n",
"\n",
"dependency_vis.display(pipeline_result[0], #should be the results of a single example, not the complete dataframe.\n",
" pos_col = 'pos', #specify the pos column\n",
" dependency_col = 'dependency', #specify the dependency column,\n",
" dependency_type_col = 'dependency_type'\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "SaeGqSvnc_Ol",
"metadata": {
"id": "SaeGqSvnc_Ol"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "03c6cb7d-b34f-4974-b63b-c04640f6a668",
"metadata": {
"id": "03c6cb7d-b34f-4974-b63b-c04640f6a668"
},
"source": [
"## Finding the `ACTION`\n",
"Let's get the root of the dependency trees (or the verb from pos)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "qHi5TD12kTuN",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 81
},
"id": "qHi5TD12kTuN",
"outputId": "58ac2f8b-4ee5-4909-cd0b-c6098bcd1ef5"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"