![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Deidentification Utils
This notebooks aims to showcase how to use the Deidentification module in `johnsnowlabs` library as a helper to carry out all deidentification tasks without any low code.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/11.1.Deidentification_Utility_Module.ipynb)

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [2]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

In [6]:
import pandas as pd
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Module

**Description of Parameters:** <br/>

---

`custom_pipeline` : Sparknlp PipelineModel, optional
            custom PipelineModel to be used for deidentification, by default None <br/>
        `ner_chunk` : str, optional
            final chunk column name of custom pipeline that will be deidentified, by default "ner_chunk" <br/>
        `fields` : dict, optional
            fields to be deidentified and their deidentification modes, by default {"text": "mask"} <br/>
        `sentence` : str, optional
            sentence column name of the given custom pipeline, by default "sentence" <br/>
        `token` : str, optional
            token column name of the given custom pipeline, by default "token" <br/>
        `document` : str, optional
            document column name of the given custom pipeline, by default "document" <br/>
        `masking_policy` : str, optional
            masking policy, by default "entity_labels" <br/>
        `fixed_mask_length` : int, optional
            fixed mask length, by default 4 <br/>
        `obfuscate_date` : bool, optional
            obfuscate date, by default True <br/>
        `obfuscate_ref_source` : str, optional
            obfuscate reference source, by default "faker" <br/>
        `obfuscate_ref_file_path` : str, optional
            obfuscate reference file path, by default None <br/>
        `age_group_obfuscation` : bool, optional
            age group obfuscation, by default False <br/>
        `age_ranges` : list, optional
            age ranges for obfuscation, by default [1, 4, 12, 20, 40, 60, 80] <br/>
        `shift_days` : bool, optional
            shift days, by default False <br/>
        `number_of_days` : int, optional
            number of days, by default None <br/>
        `documentHashCoder_col_name` : str, optional
            document hash coder column name, by default "documentHash" <br/>
        `date_tag` : str, optional
            date tag, by default "DATE" <br/>
        `language` : str, optional
            language, by default "en" <br/>
        `region` : str, optional
            region, by default "us" <br/>
        `unnormalized_date` : bool, optional
            unnormalized date, by default False <br/>
        `unnormalized_mode` : str, optional
            unnormalized mode, by default "mask" <br/>
        `id_column_name` : str, optional
            ID column name, by default "id" <br/>
        `date_shift_column_name` : str, optional
            date shift column name, by default "date_shift" <br/>
        `separator` : str, optional
            separator of input csv file, by default "\t" <br/> 
        `input_file_path` : str, optional
            input file path, by default None <br/>
        `output_file_path` : str, optional
            output file path, by default 'deidentified.csv'

**Returns**

---

Spark DataFrame: Spark DataFrame with deidentified text <br/>
csv/json file: A deidentified file.

In [7]:
text= """EMPLOYMENT AGREEMENT,   effective  as  of  June  1,  2013  between  Synergy  Resources Corporation,  a Colorado corporation (the "Company"),  and John E. Smith (the "Employee").
This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) is adopted, effective as of August 21, 2008, as set forth below."""

df = spark.createDataFrame([[text]]).toDF("text")

In [8]:
df_pd = df.toPandas()
df_pd.to_csv("deid_data.csv", sep='@', index=False)

# With a custom pipeline

In [9]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_contract_doc_parties_lg download started this may take some time.
[OK!]


In [10]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [11]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EMPLOYMENT AGREEM...|[{document, 0, 39...|[{document, 0, 17...|[{token, 0, 9, EM...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 19, E...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [12]:
result.select(F.explode('sentence')).show(truncate=50)

+--------------------------------------------------+
|                                               col|
+--------------------------------------------------+
|{document, 0, 176, EMPLOYMENT AGREEMENT,   effe...|
|{document, 178, 396, This First Amendment (Amen...|
+--------------------------------------------------+



In [13]:
from pyspark.sql import functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.ner.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

In [14]:
result_df.show()

+-----------+---------+
|      token|ner_label|
+-----------+---------+
| EMPLOYMENT|    B-DOC|
|  AGREEMENT|    I-DOC|
|          ,|        O|
|  effective|        O|
|         as|        O|
|         of|        O|
|       June|B-EFFDATE|
|          1|I-EFFDATE|
|          ,|I-EFFDATE|
|       2013|I-EFFDATE|
|    between|        O|
|    Synergy|  B-PARTY|
|  Resources|  I-PARTY|
|Corporation|  I-PARTY|
|          ,|        O|
|          a|        O|
|   Colorado|        O|
|corporation|        O|
|          (|        O|
|        the|        O|
+-----------+---------+
only showing top 20 rows



In [15]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)

+---------+-----+
|ner_label|count|
+---------+-----+
|O        |49   |
|I-PARTY  |10   |
|I-EFFDATE|6    |
|B-ALIAS  |4    |
|B-PARTY  |4    |
|B-DOC    |2    |
|I-DOC    |2    |
|B-EFFDATE|2    |
+---------+-----+



## Default parameters

In [16]:
deid_implementor = legal.Deid(spark,
                             input_file_path="deid_data.csv",
                             output_file_path="deidentified.csv", 
                             custom_pipeline=model,
                             separator='@')

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [17]:
res.show(n=50, truncate=False)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                                                                                                       |text_deidentified                                                                                                                                                                       |
+---+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [18]:
#checking saved output file
import pandas as pd
res_data = pd.read_csv("deidentified.csv")
res_data.head()

Unnamed: 0,ID,text,text_deidentified
0,0,"EMPLOYMENT AGREEMENT, effective as of Jun...","<DOC>, effective as of <EFFDATE> between..."
1,0,This First Amendment (Amendment) to the Employ...,This First Amendment (Amendment) to the <DOC> ...


## Mask options 


### same_length_chars

In [19]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_data.csv",
                              output_file_path="deidentified.csv",
                              custom_pipeline=model,
                              fields={"text": "mask"}, masking_policy="same_length_chars")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [20]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|EMPLOYMENT AGREEMENT,   effective  as  of  June  1,  2013  between  Synergy  Resources Corporation,  a Colorado corpo...|[******************],   effective  as  of  [************]  between  [****************************],  a Colorado corpo...|
|  0|Thi

### fixed_length_chars

In [21]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_data.csv",
                              output_file_path="deidentified.csv",
                              custom_pipeline=model,
                              fields={"text": "mask"}, masking_policy="fixed_length_chars", fixed_mask_length=2)

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [22]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|EMPLOYMENT AGREEMENT,   effective  as  of  June  1,  2013  between  Synergy  Resources Corporation,  a Colorado corpo...|                       **,   effective  as  of  **  between  **,  a Colorado corporation (the "**"),  and ** (the "**").|
|  0|Thi

## Obfuscate Options

### obfuscate_ref_source="file"

In [23]:
obs_lines = """John Snow Labs#PARTY
Amazon INC#PARTY
1st June, 2023#EFFDATE
23 of July, 2023#EFFDATE
Party 1#ALIAS
Party 2#ALIAS
PRIVATE AGREEMENT#DOC
CONTRACT#DOC
"""

with open ('obfuscation.txt', 'w') as f:
  f.write(obs_lines)

In [24]:
df= spark.createDataFrame([[text]]).toDF("text")
df_pd= df.toPandas()
df_pd.to_csv("deid_obfs_data.csv", index=False)

In [25]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_obfs_data.csv",
                              output_file_path="deidentified.csv",
                              custom_pipeline=model,
                              fields={"text": "obfuscate"}, obfuscate_ref_source="file",
                              obfuscate_ref_file_path="obfuscation.txt")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [26]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|EMPLOYMENT AGREEMENT,   effective  as  of  June  1,  2013  between  Synergy  Resources Corporation,  a Colorado corpo...|CONTRACT,   effective  as  of  1st June, 2023  between  John Snow Labs,  a Colorado corporation (the "Party 2"),  and...|
|  0|Thi

### obfuscate_ref_source=faker 
You can also use our internal faker library has its own vocabulary (with a predefined vocabulary for ORG, DOCUMENT TYPES, etc).

However, some entities may not be supported by faker, as the number of models increase in the Financial NLP library. If so, you will just see <ENTITY>.

In that case, please come back to a mixed or file-only approaches.

### obfuscate_ref_source=both
This option uses both internal faker library and the file. 

In [27]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_obfs_data.csv",
                              output_file_path="deidentified.csv",
                              custom_pipeline=model,
                              fields={"text": "obfuscate"}, obfuscate_ref_source="both",
                              obfuscate_ref_file_path="obfuscation.txt")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [28]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|EMPLOYMENT AGREEMENT,   effective  as  of  June  1,  2013  between  Synergy  Resources Corporation,  a Colorado corpo...|CONTRACT,   effective  as  of  1st June, 2023  between  Amazon INC,  a Colorado corporation (the "Party 2"),  and Joh...|
|  0|Thi

## Using Date Matcher to normalize the dates from text to date format

In [29]:
import pandas as pd
data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002'],
     'text' : ['EMPLOYMENT AGREEMENT, effective as of June 1, 2013', 
               'This First Amendment adopted, effective as of August 21, 2008', 
               'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 01/06/2023'
              ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|clientID|text                                                                                                                                             |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|A001    |EMPLOYMENT AGREEMENT, effective as of June 1, 2013                                                                                               |
|A001    |This First Amendment adopted, effective as of August 21, 2008                                                                                    |
|A002    |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 01/06/2023|
+--------+------------------------------------------------

In [30]:
data

Unnamed: 0,clientID,text
0,A001,"EMPLOYMENT AGREEMENT, effective as of June 1, ..."
1,A001,"This First Amendment adopted, effective as of ..."
2,A002,Amendment to the Employment Agreement between ...


In [31]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = nlp.DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setOutputFormat("dd/MM/yyyy")
    #.setAnchorDateYear(2020) \  Use these if you want to stick with a specific month and range of days
    #.setAnchorDateMonth(1) \
    #.setAnchorDateDay(11) \

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    date
])

result = pipeline.fit(my_input_df).transform(my_input_df)

In [32]:
result.select("clientID", "text", "date.begin", "date.end", "date.result").show(truncate=False)

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+------------+
|clientID|text                                                                                                                                             |begin|end  |result      |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+------------+
|A001    |EMPLOYMENT AGREEMENT, effective as of June 1, 2013                                                                                               |[38] |[49] |[01/06/2013]|
|A001    |This First Amendment adopted, effective as of August 21, 2008                                                                                    |[46] |[60] |[21/08/2008]|
|A002    |Amendment to the Employment Agreement between Service 1st Bank located in Stockt

You can then use `result` and `text[begin:end]` to modify your original strings

### shifting days according to the ID column
We will normalize the dates first using a DateMatcher

In [33]:
import pandas as pd
data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002'],
     'text' : ['EMPLOYMENT AGREEMENT, effective as of 01/06/2013', 
               'This First Amendment adopted, effective as of 21/08/2008', 
               'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023'
              ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|clientID|text                                                                                                                                             |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|A001    |EMPLOYMENT AGREEMENT, effective as of 01/06/2013                                                                                                 |
|A001    |This First Amendment adopted, effective as of 21/08/2008                                                                                         |
|A002    |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|
+--------+------------------------------------------------

In [34]:
df_pd = my_input_df.toPandas()
df_pd.to_csv("deid_id_data.csv", index=False)

Custom pipeline with `DocumentHashCoder()`. 

In [35]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setPatientIdColumn("clientID")\
    .setSeed(100)

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\
    .setInputCols(["document2", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["document2","token","ner"])\
    .setOutputCol("ner_chunk")
    
nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "clientID")

pipeline_model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_deid download started this may take some time.
[OK!]


In [36]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_id_data.csv",
                              output_file_path="deidentified.csv",
                              custom_pipeline=pipeline_model,
                              fields={"text": "obfuscate"},
                              shift_days=True,
                              obfuscate_date=True, 
                              ner_chunk="ner_chunk",
                              token="token",
                              documenthashcoder_col_name="document2",
                              separator=",",
                              unnormalized_date=False)

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [37]:
res.show(truncate=False)

+---+-------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                             |text_deidentified                                                                                                            |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+
|0  |EMPLOYMENT AGREEMENT, effective as of 01/06/2013                                                                                                 |EMPLOYMENT AGRE

### shifting days according to specified values: XX/XX/XXXX or textual formats: June 10th, 2023

In [38]:
data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002'],
     'text' : ['EMPLOYMENT AGREEMENT, effective as of 10 June 2013', 
               'This First Amendment adopted, effective as of August 8th, 2008', 
               'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023'
              ],
     'dateshift' : ['10', '-2', '30']
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|clientID|text                                                                                                                                             |dateshift|
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|A001    |EMPLOYMENT AGREEMENT, effective as of 10 June 2013                                                                                               |10       |
|A001    |This First Amendment adopted, effective as of August 8th, 2008                                                                                   |-2       |
|A002    |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|30       

In [39]:
df_pd= my_input_df.toPandas()
df_pd.to_csv("deid_specific_data.csv", index=False)

In [40]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\
    .setInputCols(["document2", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["document2","token","ner"])\
    .setOutputCol("ner_chunk")
    
nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("clientID","text", "dateshift")

pipeline_col_model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_deid download started this may take some time.
[OK!]


In [41]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_specific_data.csv",
                              separator=",",
                              output_file_path="deid_specific_data.csv",
                              custom_pipeline=pipeline_col_model,
                              fields={"text": "obfuscate"},
                              shift_days=True,
                              obfuscate_date=True, 
                              ner_chunk="ner_chunk",
                              token="token",
                              documenthashcoder_col_name="document2")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deid_specific_data.csv' !


In [42]:
res.show(truncate=False)

+---+-------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                             |text_deidentified                                                                                                                   |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
|0  |EMPLOYMENT AGREEMENT, effective as of 10 June 2013                                                                                          

### unnormalized date formats

In [43]:
import pandas as pd

data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002'],
     'text' : ['EMPLOYMENT AGREEMENT, effective as of 3May2002', 
               'This First Amendment adopted, effective as of Agust 8th, 2008', 
               'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023'
              ],
     'dateshift' : ['10', '-2', '30']
    }
)


my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|clientID|text                                                                                                                                             |dateshift|
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|A001    |EMPLOYMENT AGREEMENT, effective as of 3May2002                                                                                                   |10       |
|A001    |This First Amendment adopted, effective as of Agust 8th, 2008                                                                                    |-2       |
|A002    |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|30       

In [44]:
df_pd = my_input_df.toPandas()
df_pd.to_csv("deid_unnormalized_data.csv", index=False)

In [45]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_unnormalized_data.csv",
                              output_file_path="deidentified.csv",
                              custom_pipeline=pipeline_col_model,
                              fields={"text": "obfuscate"},
                              shift_days=True,
                              obfuscate_date=True, 
                              ner_chunk="ner_chunk",
                              token="token",
                              documenthashcoder_col_name="document2",
                              separator=",",
                              unnormalized_date=True,
                              unnormalized_mode="mask"
                            )

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [46]:
res.show(truncate=False)

+---+-------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                             |text_deidentified                                                                                                          |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|0  |EMPLOYMENT AGREEMENT, effective as of 3May2002                                                                                                   |EMPLOYMENT AGREEMENT,

**unnormalized_mode="obfuscate"**

In [47]:
deid_implementor = legal.Deid(spark,
                              input_file_path="deid_unnormalized_data.csv",
                              output_file_path="deidentified1.csv",
                              custom_pipeline=pipeline_col_model,
                              fields={"text": "obfuscate"},
                              shift_days=True,
                              obfuscate_date=True, 
                              ner_chunk="ner_chunk",
                              token="token",
                              documenthashcoder_col_name="document2",
                              separator=",",
                              unnormalized_date=True,
                              unnormalized_mode="obfuscate"
                            )

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified1.csv' !


In [48]:
res.show(truncate=False)

+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                             |text_deidentified                                                                                                         |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
|0  |EMPLOYMENT AGREEMENT, effective as of 3May2002                                                                                                   |EMPLOYMENT AGREEMENT, ef

# Default pipeline for Legal domain
 This pipeline does not include the masking of DOCUMENT TYPE.

In [49]:
deid_implementor = legal.Deid(spark,
                              ner_chunk = "merged_ner_chunks",
                              input_file_path="deid_data.csv",
                              output_file_path="deidentified_custompipe.csv",
                              domain="legal")

In [50]:
res.show(truncate=False)

+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                             |text_deidentified                                                                                                         |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
|0  |EMPLOYMENT AGREEMENT, effective as of 3May2002                                                                                                   |EMPLOYMENT AGREEMENT, ef

# Structured Deidentification

In [76]:
# sample data
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/hipaa-table-001.txt

df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

## Default parameters

In [77]:
obfuscator = legal.StructuredDeidentification(spark,{"NAME":"NAME","AGE":"AGE"}, obfuscateRefSource = "faker")
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.show(truncate=False)

+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+
|NAME                |DOB       |AGE  |ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+
|[Natasha Bence]     |04/02/1935|[93] |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|[Karie Chimera]     |03/10/2009|[5]  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|[Lucita Ferrara]    |11/01/1921|[93] |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|[Lorane Gell]       |13/02/2002|[12] |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|[Lowell Guitar]     |20/08/1942|[73] |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|[Jimmey

## ref_source=File

In [78]:
obfuscator_unique_ref_test = '''Will Perry#NAME
John Smith#NAME
Marvin MARSHALL#NAME
Hubert GROGAN#NAME
ALTHEA COLBURN#NAME
Kalil AMIN#NAME
Inci FOUNTAIN#NAME
Jackson WILLE#NAME
Jack SANTOS#NAME
Mahmood ALBURN#NAME
Marnie MELINGTON#NAME
Aysha GHAZI#NAME
Maryland CODER#NAME
Darene GEORGIOUS#NAME
Shelly WELLBECK#NAME
Min Kun JAE#NAME
Thomson THOMAS#NAME
Christian SUDDINBURG#NAME
20#AGE
30#AGE
40#AGE
50#AGE
60#AGE
(901)111-2222#TEL
(109)333 1343#TEL
(570) 874-1112#TEL
(901)111-2222#TEL
(109)333 1343#TEL
(570) 874-1112#TEL
28450#ZIPCODE
49144#ZIPCODE
14412#ZIPCODE
10/10/1983#DOB
04/06/1990#DOB
03/11/2001#DOB
'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [79]:
obfuscator = legal.StructuredDeidentification(spark,{"NAME":"NAME","AGE":"AGE"}, 
                                        obfuscateRefFile = "/content/obfuscator_unique_ref_test.txt",
                                        obfuscateRefSource = "file",
                                        columnsSeed={"NAME": 23, "AGE": 23})
obfuscator_df = obfuscator.obfuscateColumns(df)

obfuscator_df.select("NAME","AGE").show(truncate=False)

+------------------+----+
|NAME              |AGE |
+------------------+----+
|[Inci FOUNTAIN]   |[60]|
|[Jack SANTOS]     |[30]|
|[Darene GEORGIOUS]|[30]|
|[Shelly WELLBECK] |[40]|
|[Hubert GROGAN]   |[40]|
|[Kalil AMIN]      |[40]|
|[ALTHEA COLBURN]  |[60]|
|[Thomson THOMAS]  |[60]|
|[Jack SANTOS]     |[60]|
|[Will Perry]      |[20]|
|[Jackson WILLE]   |[60]|
|[Shelly WELLBECK] |[40]|
|[Kalil AMIN]      |[30]|
|[Marnie MELINGTON]|[30]|
|[Min Kun JAE]     |[30]|
|[Marvin MARSHALL] |[60]|
|[Marvin MARSHALL] |[50]|
|[Min Kun JAE]     |[30]|
|[Maryland CODER]  |[20]|
|[Marnie MELINGTON]|[20]|
+------------------+----+
only showing top 20 rows



## shift days

In [80]:
# We can shift n days in the structured deidentification through "days" parameter when the column is a Date.

df = spark.createDataFrame([
            ["Juan García", "13/02/1977", "711 Nulla St.", "140", "673 431234"],
            ["Will Smith", "23/02/1977", "1 Green Avenue.", "140", "+23 (673) 431234"],
            ["Pedro Ximénez", "11/04/1900", "Calle del Libertador, 7", "100", "912 345623"]
        ]).toDF("NAME", "DOB", "ADDRESS", "SBP", "TEL")

df_pd= df.toPandas()
df_pd.to_csv("deid_dayshift_structured_data.csv", index=False)
df_pd.head()

Unnamed: 0,NAME,DOB,ADDRESS,SBP,TEL
0,Juan García,13/02/1977,711 Nulla St.,140,673 431234
1,Will Smith,23/02/1977,1 Green Avenue.,140,+23 (673) 431234
2,Pedro Ximénez,11/04/1900,"Calle del Libertador, 7",100,912 345623


In [82]:
obfuscator = legal.StructuredDeidentification(spark, 
                                             columns = {"NAME": "ID", "DOB": "DATE"},
                                             obfuscateRefSource = "faker",
                                             columnsSeed={"NAME": 23, "AGE": 23},
                                             days = 5)
obfuscator_df = obfuscator.obfuscateColumns(df)

In [83]:
obfuscator_df.show(truncate=False)

+----------+------------+-----------------------+---+----------------+
|NAME      |DOB         |ADDRESS                |SBP|TEL             |
+----------+------------+-----------------------+---+----------------+
|[G9296129]|[18/02/1977]|711 Nulla St.          |140|673 431234      |
|[M9239301]|[28/02/1977]|1 Green Avenue.        |140|+23 (673) 431234|
|[H3156881]|[16/04/1900]|Calle del Libertador, 7|100|912 345623      |
+----------+------------+-----------------------+---+----------------+

