{"cells":[{"cell_type":"markdown","source":["![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"],"metadata":{"id":"uaBopjb7IIuk"},"id":"uaBopjb7IIuk"},{"cell_type":"markdown","metadata":{"id":"fb2ac5e3-fe7c-4431-bb24-3c48649be54b"},"source":["# Deidentification Utils\n","This notebooks aims to showcase how to use the Deidentification module in `johnsnowlabs` library as a helper to carry out all deidentification tasks without any low code."],"id":"fb2ac5e3-fe7c-4431-bb24-3c48649be54b"},{"cell_type":"markdown","source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/11.1.Deidentification_Utility_Module.ipynb)"],"metadata":{"id":"bbyFtQSnIKLp"},"id":"bbyFtQSnIKLp"},{"cell_type":"markdown","metadata":{"collapsed":false,"id":"gk3kZHmNj51v"},"source":["# Installation"],"id":"gk3kZHmNj51v"},{"cell_type":"code","execution_count":null,"metadata":{"id":"_914itZsj51v","pycharm":{"is_executing":true}},"outputs":[],"source":["! pip install -q johnsnowlabs"],"id":"_914itZsj51v"},{"cell_type":"markdown","metadata":{"id":"YPsbAnNoPt0Z"},"source":["## Automatic Installation\n","Using my.johnsnowlabs.com SSO"],"id":"YPsbAnNoPt0Z"},{"cell_type":"code","execution_count":2,"metadata":{"id":"fY0lcShkj51w","pycharm":{"is_executing":true},"executionInfo":{"status":"ok","timestamp":1685531516838,"user_tz":-180,"elapsed":475,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[],"source":["from johnsnowlabs import nlp, legal\n","\n","# nlp.install(force_browser=True)"],"id":"fY0lcShkj51w"},{"cell_type":"markdown","metadata":{"id":"hsJvn_WWM2GL"},"source":["## Manual downloading\n","If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n","\n","- Go to my.johnsnowlabs.com\n","- Download your license\n","- Upload it using the following command"],"id":"hsJvn_WWM2GL"},{"cell_type":"code","execution_count":null,"metadata":{"id":"i57QV3-_P2sQ"},"outputs":[],"source":["from google.colab import files\n","print('Please Upload your John Snow Labs License using the button below')\n","license_keys = files.upload()"],"id":"i57QV3-_P2sQ"},{"cell_type":"markdown","metadata":{"id":"xGgNdFzZP_hQ"},"source":["- Install it"],"id":"xGgNdFzZP_hQ"},{"cell_type":"code","execution_count":null,"metadata":{"id":"OfmmPqknP4rR"},"outputs":[],"source":["nlp.install()"],"id":"OfmmPqknP4rR"},{"cell_type":"markdown","metadata":{"id":"DCl5ErZkNNLk"},"source":["# Starting"],"id":"DCl5ErZkNNLk"},{"cell_type":"code","execution_count":null,"metadata":{"id":"wRXTnNl3j51w"},"outputs":[],"source":["spark = nlp.start()"],"id":"wRXTnNl3j51w"},{"cell_type":"code","source":["import pandas as pd\n","from pyspark.sql import DataFrame\n","import pyspark.sql.functions as F\n","import pyspark.sql.types as T\n","import pyspark.sql as SQL\n","from pyspark import keyword_only"],"metadata":{"id":"guXFthmRBnLk","executionInfo":{"status":"ok","timestamp":1685531662699,"user_tz":-180,"elapsed":6,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"execution_count":6,"outputs":[],"id":"guXFthmRBnLk"},{"cell_type":"markdown","source":["# Module"],"metadata":{"id":"Gp7HROWvKJAW"},"id":"Gp7HROWvKJAW"},{"cell_type":"markdown","source":["**Description of Parameters:**
\n","\n","---\n","\n","`custom_pipeline` : Sparknlp PipelineModel, optional\n"," custom PipelineModel to be used for deidentification, by default None
\n"," `ner_chunk` : str, optional\n"," final chunk column name of custom pipeline that will be deidentified, by default \"ner_chunk\"
\n"," `fields` : dict, optional\n"," fields to be deidentified and their deidentification modes, by default {\"text\": \"mask\"}
\n"," `sentence` : str, optional\n"," sentence column name of the given custom pipeline, by default \"sentence\"
\n"," `token` : str, optional\n"," token column name of the given custom pipeline, by default \"token\"
\n"," `document` : str, optional\n"," document column name of the given custom pipeline, by default \"document\"
\n"," `masking_policy` : str, optional\n"," masking policy, by default \"entity_labels\"
\n"," `fixed_mask_length` : int, optional\n"," fixed mask length, by default 4
\n"," `obfuscate_date` : bool, optional\n"," obfuscate date, by default True
\n"," `obfuscate_ref_source` : str, optional\n"," obfuscate reference source, by default \"faker\"
\n"," `obfuscate_ref_file_path` : str, optional\n"," obfuscate reference file path, by default None
\n"," `age_group_obfuscation` : bool, optional\n"," age group obfuscation, by default False
\n"," `age_ranges` : list, optional\n"," age ranges for obfuscation, by default [1, 4, 12, 20, 40, 60, 80]
\n"," `shift_days` : bool, optional\n"," shift days, by default False
\n"," `number_of_days` : int, optional\n"," number of days, by default None
\n"," `documentHashCoder_col_name` : str, optional\n"," document hash coder column name, by default \"documentHash\"
\n"," `date_tag` : str, optional\n"," date tag, by default \"DATE\"
\n"," `language` : str, optional\n"," language, by default \"en\"
\n"," `region` : str, optional\n"," region, by default \"us\"
\n"," `unnormalized_date` : bool, optional\n"," unnormalized date, by default False
\n"," `unnormalized_mode` : str, optional\n"," unnormalized mode, by default \"mask\"
\n"," `id_column_name` : str, optional\n"," ID column name, by default \"id\"
\n"," `date_shift_column_name` : str, optional\n"," date shift column name, by default \"date_shift\"
\n"," `separator` : str, optional\n"," separator of input csv file, by default \"\\t\"
\n"," `input_file_path` : str, optional\n"," input file path, by default None
\n"," `output_file_path` : str, optional\n"," output file path, by default 'deidentified.csv'"],"metadata":{"id":"TXG4aUNnNaiD"},"id":"TXG4aUNnNaiD"},{"cell_type":"markdown","source":["**Returns**\n","\n","---\n","\n","Spark DataFrame: Spark DataFrame with deidentified text
\n","csv/json file: A deidentified file."],"metadata":{"id":"k1VfjOE3QILp"},"id":"k1VfjOE3QILp"},{"cell_type":"code","execution_count":7,"id":"aa341709-5099-41ae-b93a-7981d3d4be20","metadata":{"id":"aa341709-5099-41ae-b93a-7981d3d4be20","executionInfo":{"status":"ok","timestamp":1685531666881,"user_tz":-180,"elapsed":4186,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[],"source":["text= \"\"\"EMPLOYMENT AGREEMENT, effective as of June 1, 2013 between Synergy Resources Corporation, a Colorado corporation (the \"Company\"), and John E. Smith (the \"Employee\").\n","This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) is adopted, effective as of August 21, 2008, as set forth below.\"\"\"\n","\n","df = spark.createDataFrame([[text]]).toDF(\"text\")"]},{"cell_type":"code","source":["df_pd = df.toPandas()\n","df_pd.to_csv(\"deid_data.csv\", sep='@', index=False)"],"metadata":{"id":"3x88tenHPiLo","executionInfo":{"status":"ok","timestamp":1685531672246,"user_tz":-180,"elapsed":5368,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"3x88tenHPiLo","execution_count":8,"outputs":[]},{"cell_type":"markdown","id":"c0aca490-e7b3-4c0e-b4ad-efb7d682a327","metadata":{"id":"c0aca490-e7b3-4c0e-b4ad-efb7d682a327"},"source":["# With a custom pipeline"]},{"cell_type":"code","execution_count":9,"id":"69201fae-274f-4d5e-99c8-24985e5feba0","metadata":{"scrolled":true,"tags":[],"id":"69201fae-274f-4d5e-99c8-24985e5feba0","colab":{"base_uri":"https://localhost:8080/"},"outputId":"8afa3252-2aa3-4bd2-d345-e57b4bef616d","executionInfo":{"status":"ok","timestamp":1685531766381,"user_tz":-180,"elapsed":94152,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["roberta_embeddings_legal_roberta_base download started this may take some time.\n","Approximate size to download 447.2 MB\n","[OK!]\n","legner_contract_doc_parties_lg download started this may take some time.\n","[OK!]\n"]}],"source":["documentAssembler = nlp.DocumentAssembler()\\\n"," .setInputCol(\"text\")\\\n"," .setOutputCol(\"document\")\n","\n","sentenceDetector = nlp.SentenceDetector()\\\n"," .setInputCols([\"document\"])\\\n"," .setOutputCol(\"sentence\")\n","\n","tokenizer = nlp.Tokenizer()\\\n"," .setInputCols([\"sentence\"])\\\n"," .setOutputCol(\"token\")\n","\n","embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\",\"en\") \\\n"," .setInputCols([\"sentence\", \"token\"]) \\\n"," .setOutputCol(\"embeddings\")\n","\n","legal_ner = legal.NerModel.pretrained(\"legner_contract_doc_parties_lg\", \"en\", \"legal/models\")\\\n"," .setInputCols([\"sentence\", \"token\", \"embeddings\"]) \\\n"," .setOutputCol(\"ner\") \n"," #.setLabelCasing(\"upper\")\n","\n","ner_converter = legal.NerConverterInternal() \\\n"," .setInputCols([\"sentence\", \"token\", \"ner\"])\\\n"," .setOutputCol(\"ner_chunk\")\n","\n","nlpPipeline = nlp.Pipeline(stages=[\n"," documentAssembler, \n"," sentenceDetector,\n"," tokenizer,\n"," embeddings,\n"," legal_ner,\n"," ner_converter])\n","\n","empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n","\n","model = nlpPipeline.fit(empty_data)"]},{"cell_type":"code","source":["result = model.transform(spark.createDataFrame([[text]]).toDF(\"text\"))"],"metadata":{"id":"49NhBvSa9hYW","executionInfo":{"status":"ok","timestamp":1685531767163,"user_tz":-180,"elapsed":783,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"49NhBvSa9hYW","execution_count":10,"outputs":[]},{"cell_type":"code","source":["result.show()"],"metadata":{"id":"yHG_FTIqPtju","colab":{"base_uri":"https://localhost:8080/"},"outputId":"576c7858-7f14-4dbf-d46c-a9450770ac40","executionInfo":{"status":"ok","timestamp":1685531779844,"user_tz":-180,"elapsed":12682,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"yHG_FTIqPtju","execution_count":11,"outputs":[{"output_type":"stream","name":"stdout","text":["+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n","| text| document| sentence| token| embeddings| ner| ner_chunk|\n","+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n","|EMPLOYMENT AGREEM...|[{document, 0, 39...|[{document, 0, 17...|[{token, 0, 9, EM...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 19, E...|\n","+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n","\n"]}]},{"cell_type":"code","source":["result.select(F.explode('sentence')).show(truncate=50)"],"metadata":{"id":"S4zcGzQlPx6V","colab":{"base_uri":"https://localhost:8080/"},"outputId":"5f9cc251-28a7-45fb-ffad-aa1f4d521dab","executionInfo":{"status":"ok","timestamp":1685531786906,"user_tz":-180,"elapsed":7064,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"S4zcGzQlPx6V","execution_count":12,"outputs":[{"output_type":"stream","name":"stdout","text":["+--------------------------------------------------+\n","| col|\n","+--------------------------------------------------+\n","|{document, 0, 176, EMPLOYMENT AGREEMENT, effe...|\n","|{document, 178, 396, This First Amendment (Amen...|\n","+--------------------------------------------------+\n","\n"]}]},{"cell_type":"code","source":["from pyspark.sql import functions as F\n","\n","result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n"," result.ner.result)).alias(\"cols\")) \\\n"," .select(F.expr(\"cols['0']\").alias(\"token\"),\n"," F.expr(\"cols['1']\").alias(\"ner_label\"))"],"metadata":{"id":"gcce6cXw9jj_","executionInfo":{"status":"ok","timestamp":1685531787336,"user_tz":-180,"elapsed":431,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"gcce6cXw9jj_","execution_count":13,"outputs":[]},{"cell_type":"code","source":["result_df.show()"],"metadata":{"id":"CEMRr_OAPiqy","colab":{"base_uri":"https://localhost:8080/"},"outputId":"4ae75b96-3fc8-4d5f-a7b6-4003f48c372f","executionInfo":{"status":"ok","timestamp":1685531790527,"user_tz":-180,"elapsed":3192,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"CEMRr_OAPiqy","execution_count":14,"outputs":[{"output_type":"stream","name":"stdout","text":["+-----------+---------+\n","| token|ner_label|\n","+-----------+---------+\n","| EMPLOYMENT| B-DOC|\n","| AGREEMENT| I-DOC|\n","| ,| O|\n","| effective| O|\n","| as| O|\n","| of| O|\n","| June|B-EFFDATE|\n","| 1|I-EFFDATE|\n","| ,|I-EFFDATE|\n","| 2013|I-EFFDATE|\n","| between| O|\n","| Synergy| B-PARTY|\n","| Resources| I-PARTY|\n","|Corporation| I-PARTY|\n","| ,| O|\n","| a| O|\n","| Colorado| O|\n","|corporation| O|\n","| (| O|\n","| the| O|\n","+-----------+---------+\n","only showing top 20 rows\n","\n"]}]},{"cell_type":"code","source":["result_df.select(\"token\", \"ner_label\").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)"],"metadata":{"id":"6y-Eu10e9lO2","colab":{"base_uri":"https://localhost:8080/"},"outputId":"32ffdc4c-df06-4a36-ec57-1d2dea9f11a6","executionInfo":{"status":"ok","timestamp":1685531801754,"user_tz":-180,"elapsed":11231,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"6y-Eu10e9lO2","execution_count":15,"outputs":[{"output_type":"stream","name":"stdout","text":["+---------+-----+\n","|ner_label|count|\n","+---------+-----+\n","|O |49 |\n","|I-PARTY |10 |\n","|I-EFFDATE|6 |\n","|B-ALIAS |4 |\n","|B-PARTY |4 |\n","|B-DOC |2 |\n","|I-DOC |2 |\n","|B-EFFDATE|2 |\n","+---------+-----+\n","\n"]}]},{"cell_type":"markdown","source":["## Default parameters"],"metadata":{"id":"Xxf8EYe-MmbA"},"id":"Xxf8EYe-MmbA"},{"cell_type":"code","source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_data.csv\",\n"," output_file_path=\"deidentified.csv\", \n"," custom_pipeline=model,\n"," separator='@')\n","\n","res = deid_implementor.deidentify()"],"metadata":{"id":"yK-ddxprM4oz","executionInfo":{"status":"ok","timestamp":1685531808985,"user_tz":-180,"elapsed":7232,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}},"outputId":"f7ba6208-9b1d-4037-d049-a8bd4d3d5af1","colab":{"base_uri":"https://localhost:8080/"}},"id":"yK-ddxprM4oz","execution_count":16,"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}]},{"cell_type":"code","source":["res.show(n=50, truncate=False)"],"metadata":{"id":"v_tr4YZpMlql","colab":{"base_uri":"https://localhost:8080/"},"outputId":"07a54fdd-f7bb-4dba-9e1b-3068aa2821c9","executionInfo":{"status":"ok","timestamp":1685531809860,"user_tz":-180,"elapsed":876,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"v_tr4YZpMlql","execution_count":17,"outputs":[{"output_type":"stream","name":"stdout","text":["+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|ID |text |text_deidentified |\n","+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|0 |EMPLOYMENT AGREEMENT, effective as of June 1, 2013 between Synergy Resources Corporation, a Colorado corporation (the \"Company\"), and John E. Smith (the \"Employee\"). |, effective as of between , a Colorado corporation (the \"\"), and (the \"\"). |\n","|0 |This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) is adopted, effective as of August 21, 2008, as set forth below.|This First Amendment (Amendment) to the between located in Stockton, California () and () is adopted, effective as of , as set forth below.|\n","+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}]},{"cell_type":"code","source":["#checking saved output file\n","import pandas as pd\n","res_data = pd.read_csv(\"deidentified.csv\")\n","res_data.head()"],"metadata":{"id":"FeoeISmMRFny","colab":{"base_uri":"https://localhost:8080/","height":112},"outputId":"2ba15973-497a-42d3-a7e7-b883f9dcbef4","executionInfo":{"status":"ok","timestamp":1685531809860,"user_tz":-180,"elapsed":11,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"FeoeISmMRFny","execution_count":18,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" ID text \\\n","0 0 EMPLOYMENT AGREEMENT, effective as of Jun... \n","1 0 This First Amendment (Amendment) to the Employ... \n","\n"," text_deidentified \n","0 , effective as of between... \n","1 This First Amendment (Amendment) to the ... "],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
IDtexttext_deidentified
00EMPLOYMENT AGREEMENT, effective as of Jun...<DOC>, effective as of <EFFDATE> between...
10This First Amendment (Amendment) to the Employ...This First Amendment (Amendment) to the <DOC> ...
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":18}]},{"cell_type":"markdown","id":"6f496f91-9d58-4b2d-83dd-ed0de657460c","metadata":{"id":"6f496f91-9d58-4b2d-83dd-ed0de657460c"},"source":["## Mask options \n"]},{"cell_type":"markdown","source":["### same_length_chars"],"metadata":{"id":"PiV_Sdg0Qq1M"},"id":"PiV_Sdg0Qq1M"},{"cell_type":"code","source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_data.csv\",\n"," output_file_path=\"deidentified.csv\",\n"," custom_pipeline=model,\n"," fields={\"text\": \"mask\"}, masking_policy=\"same_length_chars\")\n","\n","res = deid_implementor.deidentify()"],"metadata":{"id":"ylycXCmRHIFE","executionInfo":{"status":"ok","timestamp":1685531815612,"user_tz":-180,"elapsed":5761,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}},"outputId":"13213538-8c5c-4937-c7e6-634e954cd636","colab":{"base_uri":"https://localhost:8080/"}},"id":"ylycXCmRHIFE","execution_count":19,"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}]},{"cell_type":"code","execution_count":20,"id":"105724c1-4803-473a-9fcd-83ef3460fd96","metadata":{"id":"105724c1-4803-473a-9fcd-83ef3460fd96","colab":{"base_uri":"https://localhost:8080/"},"outputId":"380ca602-1717-4566-cace-7337b8c565b1","executionInfo":{"status":"ok","timestamp":1685531816919,"user_tz":-180,"elapsed":1308,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| ID| text| text_deidentified|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| 0|EMPLOYMENT AGREEMENT, effective as of June 1, 2013 between Synergy Resources Corporation, a Colorado corpo...|[******************], effective as of [************] between [****************************], a Colorado corpo...|\n","| 0|This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California...|This First Amendment (Amendment) to the [******************] between [**************] located in Stockton, California...|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=120)"]},{"cell_type":"markdown","id":"b8324fef-246d-4731-be94-fbe818bcc3de","metadata":{"id":"b8324fef-246d-4731-be94-fbe818bcc3de"},"source":["### fixed_length_chars"]},{"cell_type":"code","execution_count":21,"id":"3e21fe0b-fa7d-42d6-94c0-fc437583cae5","metadata":{"id":"3e21fe0b-fa7d-42d6-94c0-fc437583cae5","colab":{"base_uri":"https://localhost:8080/"},"outputId":"e6460975-f742-4ef2-c7df-a275bdc8e739","executionInfo":{"status":"ok","timestamp":1685531821588,"user_tz":-180,"elapsed":4671,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_data.csv\",\n"," output_file_path=\"deidentified.csv\",\n"," custom_pipeline=model,\n"," fields={\"text\": \"mask\"}, masking_policy=\"fixed_length_chars\", fixed_mask_length=2)\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":22,"id":"772513fd-01a7-4c6e-afb1-1863ac863636","metadata":{"id":"772513fd-01a7-4c6e-afb1-1863ac863636","colab":{"base_uri":"https://localhost:8080/"},"outputId":"97da48e4-2ab3-4f24-fd87-e7d5a1d3f9c9","executionInfo":{"status":"ok","timestamp":1685531821994,"user_tz":-180,"elapsed":407,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| ID| text| text_deidentified|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| 0|EMPLOYMENT AGREEMENT, effective as of June 1, 2013 between Synergy Resources Corporation, a Colorado corpo...| **, effective as of ** between **, a Colorado corporation (the \"**\"), and ** (the \"**\").|\n","| 0|This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California...|This First Amendment (Amendment) to the ** between ** located in Stockton, California (**) and ** (**) is adopted, ef...|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=120)"]},{"cell_type":"markdown","id":"f3d4e399-0ae4-441d-bc03-2040e65f8181","metadata":{"id":"f3d4e399-0ae4-441d-bc03-2040e65f8181"},"source":["## Obfuscate Options"]},{"cell_type":"markdown","source":["### obfuscate_ref_source=\"file\""],"metadata":{"id":"Qr54LYqETovR"},"id":"Qr54LYqETovR"},{"cell_type":"code","source":["obs_lines = \"\"\"John Snow Labs#PARTY\n","Amazon INC#PARTY\n","1st June, 2023#EFFDATE\n","23 of July, 2023#EFFDATE\n","Party 1#ALIAS\n","Party 2#ALIAS\n","PRIVATE AGREEMENT#DOC\n","CONTRACT#DOC\n","\"\"\"\n","\n","with open ('obfuscation.txt', 'w') as f:\n"," f.write(obs_lines)"],"metadata":{"id":"ysKMVf2sTt-L","executionInfo":{"status":"ok","timestamp":1685531821994,"user_tz":-180,"elapsed":1,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"ysKMVf2sTt-L","execution_count":23,"outputs":[]},{"cell_type":"code","source":["df= spark.createDataFrame([[text]]).toDF(\"text\")\n","df_pd= df.toPandas()\n","df_pd.to_csv(\"deid_obfs_data.csv\", index=False)"],"metadata":{"id":"YF3kzLRDUQ3i","executionInfo":{"status":"ok","timestamp":1685531822324,"user_tz":-180,"elapsed":331,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"YF3kzLRDUQ3i","execution_count":24,"outputs":[]},{"cell_type":"code","execution_count":25,"id":"4fa656da-8090-4546-acc4-13c328aa8769","metadata":{"id":"4fa656da-8090-4546-acc4-13c328aa8769","colab":{"base_uri":"https://localhost:8080/"},"outputId":"2ac5139b-943b-4bc5-b34c-ca195674dc53","executionInfo":{"status":"ok","timestamp":1685531826327,"user_tz":-180,"elapsed":4004,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_obfs_data.csv\",\n"," output_file_path=\"deidentified.csv\",\n"," custom_pipeline=model,\n"," fields={\"text\": \"obfuscate\"}, obfuscate_ref_source=\"file\",\n"," obfuscate_ref_file_path=\"obfuscation.txt\")\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":26,"id":"f869c051-261a-4286-acc5-019842c5be22","metadata":{"id":"f869c051-261a-4286-acc5-019842c5be22","colab":{"base_uri":"https://localhost:8080/"},"outputId":"f36f8d25-64c0-4747-e031-dfae207f6a81","executionInfo":{"status":"ok","timestamp":1685531826766,"user_tz":-180,"elapsed":453,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| ID| text| text_deidentified|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| 0|EMPLOYMENT AGREEMENT, effective as of June 1, 2013 between Synergy Resources Corporation, a Colorado corpo...|CONTRACT, effective as of 1st June, 2023 between John Snow Labs, a Colorado corporation (the \"Party 2\"), and...|\n","| 0|This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California...|This First Amendment (Amendment) to the CONTRACT between Amazon INC located in Stockton, California (Party 1) and Ama...|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=120)"]},{"cell_type":"markdown","id":"0b123c35-84ab-45be-9137-46a63e44791e","metadata":{"id":"0b123c35-84ab-45be-9137-46a63e44791e"},"source":["### obfuscate_ref_source=faker \n","You can also use our internal faker library has its own vocabulary (with a predefined vocabulary for ORG, DOCUMENT TYPES, etc).\n","\n","However, some entities may not be supported by faker, as the number of models increase in the Financial NLP library. If so, you will just see .\n","\n","In that case, please come back to a mixed or file-only approaches."]},{"cell_type":"markdown","id":"b40a41d1-3472-40db-99e1-9430aa3d4fbf","metadata":{"id":"b40a41d1-3472-40db-99e1-9430aa3d4fbf"},"source":["### obfuscate_ref_source=both\n","This option uses both internal faker library and the file. "]},{"cell_type":"code","execution_count":27,"id":"67ae99c3-5aeb-4631-8780-da489e53a70e","metadata":{"id":"67ae99c3-5aeb-4631-8780-da489e53a70e","colab":{"base_uri":"https://localhost:8080/"},"outputId":"2a9a5a79-3d37-4184-9398-d28294860523","executionInfo":{"status":"ok","timestamp":1685531832577,"user_tz":-180,"elapsed":5814,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_obfs_data.csv\",\n"," output_file_path=\"deidentified.csv\",\n"," custom_pipeline=model,\n"," fields={\"text\": \"obfuscate\"}, obfuscate_ref_source=\"both\",\n"," obfuscate_ref_file_path=\"obfuscation.txt\")\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":28,"id":"e339d615-08a9-4507-88d3-cf712d101821","metadata":{"id":"e339d615-08a9-4507-88d3-cf712d101821","colab":{"base_uri":"https://localhost:8080/"},"outputId":"8dfc1cfd-6a68-4548-f2d6-13ef6983f1e1","executionInfo":{"status":"ok","timestamp":1685531833016,"user_tz":-180,"elapsed":455,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| ID| text| text_deidentified|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","| 0|EMPLOYMENT AGREEMENT, effective as of June 1, 2013 between Synergy Resources Corporation, a Colorado corpo...|CONTRACT, effective as of 1st June, 2023 between Amazon INC, a Colorado corporation (the \"Party 2\"), and Joh...|\n","| 0|This First Amendment (Amendment) to the Employment Agreement between Service 1st Bank located in Stockton, California...|This First Amendment (Amendment) to the PRIVATE AGREEMENT between John Snow Labs located in Stockton, California (Par...|\n","+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=120)"]},{"cell_type":"markdown","source":["## Using Date Matcher to normalize the dates from text to date format"],"metadata":{"id":"BL8sIoUQH95t"},"id":"BL8sIoUQH95t"},{"cell_type":"code","source":["import pandas as pd\n","data = pd.DataFrame(\n"," {'clientID' : ['A001', 'A001', 'A002'],\n"," 'text' : ['EMPLOYMENT AGREEMENT, effective as of June 1, 2013', \n"," 'This First Amendment adopted, effective as of August 21, 2008', \n"," 'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 01/06/2023'\n"," ]\n"," }\n",")\n","\n","my_input_df = spark.createDataFrame(data)\n","\n","my_input_df.show(truncate = False)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Gp6k45l2H9hc","outputId":"eea7ab55-c5c5-4f64-8286-9067efad6e11","executionInfo":{"status":"ok","timestamp":1685531834212,"user_tz":-180,"elapsed":1199,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"Gp6k45l2H9hc","execution_count":29,"outputs":[{"output_type":"stream","name":"stdout","text":["+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+\n","|clientID|text |\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+\n","|A001 |EMPLOYMENT AGREEMENT, effective as of June 1, 2013 |\n","|A001 |This First Amendment adopted, effective as of August 21, 2008 |\n","|A002 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 01/06/2023|\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}]},{"cell_type":"code","source":["data"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":143},"id":"zYwyJv2uICws","outputId":"88d71345-ccd3-4003-d383-fafa05c32e48","executionInfo":{"status":"ok","timestamp":1685531834212,"user_tz":-180,"elapsed":7,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"zYwyJv2uICws","execution_count":30,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" clientID text\n","0 A001 EMPLOYMENT AGREEMENT, effective as of June 1, ...\n","1 A001 This First Amendment adopted, effective as of ...\n","2 A002 Amendment to the Employment Agreement between ..."],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
clientIDtext
0A001EMPLOYMENT AGREEMENT, effective as of June 1, ...
1A001This First Amendment adopted, effective as of ...
2A002Amendment to the Employment Agreement between ...
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":30}]},{"cell_type":"code","source":["documentAssembler = nlp.DocumentAssembler() \\\n"," .setInputCol(\"text\") \\\n"," .setOutputCol(\"document\")\n","\n","date = nlp.DateMatcher() \\\n"," .setInputCols(\"document\") \\\n"," .setOutputCol(\"date\") \\\n"," .setOutputFormat(\"dd/MM/yyyy\")\n"," #.setAnchorDateYear(2020) \\ Use these if you want to stick with a specific month and range of days\n"," #.setAnchorDateMonth(1) \\\n"," #.setAnchorDateDay(11) \\\n","\n","pipeline = nlp.Pipeline().setStages([\n"," documentAssembler,\n"," date\n","])\n","\n","result = pipeline.fit(my_input_df).transform(my_input_df)"],"metadata":{"id":"80tKw5iJIE_P","executionInfo":{"status":"ok","timestamp":1685531834572,"user_tz":-180,"elapsed":2,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"80tKw5iJIE_P","execution_count":31,"outputs":[]},{"cell_type":"code","source":["result.select(\"clientID\", \"text\", \"date.begin\", \"date.end\", \"date.result\").show(truncate=False)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"f3En8TNNIgP8","outputId":"ef56d8db-ff57-4168-fd3a-4e313e5090fd","executionInfo":{"status":"ok","timestamp":1685531834572,"user_tz":-180,"elapsed":2,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"f3En8TNNIgP8","execution_count":32,"outputs":[{"output_type":"stream","name":"stdout","text":["+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+------------+\n","|clientID|text |begin|end |result |\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+------------+\n","|A001 |EMPLOYMENT AGREEMENT, effective as of June 1, 2013 |[38] |[49] |[01/06/2013]|\n","|A001 |This First Amendment adopted, effective as of August 21, 2008 |[46] |[60] |[21/08/2008]|\n","|A002 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 01/06/2023|[135]|[144]|[06/01/2023]|\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+------------+\n","\n"]}]},{"cell_type":"markdown","source":["You can then use `result` and `text[begin:end]` to modify your original strings"],"metadata":{"id":"sIDhj8znJJw7"},"id":"sIDhj8znJJw7"},{"cell_type":"markdown","id":"88c297af-e57a-4199-ac05-0665e7f9dc78","metadata":{"id":"88c297af-e57a-4199-ac05-0665e7f9dc78"},"source":["### shifting days according to the ID column\n","We will normalize the dates first using a DateMatcher"]},{"cell_type":"code","execution_count":33,"id":"fe41fd52-1ec3-4cf4-a0c2-75333ec53ea2","metadata":{"id":"fe41fd52-1ec3-4cf4-a0c2-75333ec53ea2","colab":{"base_uri":"https://localhost:8080/"},"outputId":"e6bb83dc-3041-46c0-e016-7a7256e62e7c","executionInfo":{"status":"ok","timestamp":1685531834921,"user_tz":-180,"elapsed":351,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+\n","|clientID|text |\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+\n","|A001 |EMPLOYMENT AGREEMENT, effective as of 01/06/2013 |\n","|A001 |This First Amendment adopted, effective as of 21/08/2008 |\n","|A002 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["import pandas as pd\n","data = pd.DataFrame(\n"," {'clientID' : ['A001', 'A001', 'A002'],\n"," 'text' : ['EMPLOYMENT AGREEMENT, effective as of 01/06/2013', \n"," 'This First Amendment adopted, effective as of 21/08/2008', \n"," 'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023'\n"," ]\n"," }\n",")\n","\n","my_input_df = spark.createDataFrame(data)\n","\n","my_input_df.show(truncate = False)"]},{"cell_type":"code","execution_count":34,"id":"cf4af6b3-82e9-4194-b04c-a2b9f81b7cfe","metadata":{"id":"cf4af6b3-82e9-4194-b04c-a2b9f81b7cfe","executionInfo":{"status":"ok","timestamp":1685531835229,"user_tz":-180,"elapsed":311,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[],"source":["df_pd = my_input_df.toPandas()\n","df_pd.to_csv(\"deid_id_data.csv\", index=False)"]},{"cell_type":"markdown","source":["Custom pipeline with `DocumentHashCoder()`. "],"metadata":{"id":"W4zBHBsuZT1I"},"id":"W4zBHBsuZT1I"},{"cell_type":"code","execution_count":35,"id":"bafbafc6-fec9-47f2-9184-b4a0c55918b3","metadata":{"scrolled":true,"tags":[],"id":"bafbafc6-fec9-47f2-9184-b4a0c55918b3","outputId":"50fe9fb6-aa22-4193-ea31-3a33c42d844c","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1685531848251,"user_tz":-180,"elapsed":13024,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["roberta_embeddings_legal_roberta_base download started this may take some time.\n","Approximate size to download 447.2 MB\n","[OK!]\n","legner_deid download started this may take some time.\n","[OK!]\n"]}],"source":["documentAssembler = nlp.DocumentAssembler()\\\n"," .setInputCol(\"text\")\\\n"," .setOutputCol(\"document\")\n","\n","documentHasher = legal.DocumentHashCoder()\\\n"," .setInputCols(\"document\")\\\n"," .setOutputCol(\"document2\")\\\n"," .setRangeDays(100)\\\n"," .setNewDateShift(\"shift_days\")\\\n"," .setPatientIdColumn(\"clientID\")\\\n"," .setSeed(100)\n","\n","tokenizer = nlp.Tokenizer()\\\n"," .setInputCols([\"document2\"])\\\n"," .setOutputCol(\"token\")\n","\n","embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\",\"en\") \\\n"," .setInputCols([\"document2\", \"token\"]) \\\n"," .setOutputCol(\"embeddings\")\n","\n","ner_model = legal.NerModel.pretrained('legner_deid', \"en\", \"legal/models\")\\\n"," .setInputCols([\"document2\", \"token\", \"embeddings\"])\\\n"," .setOutputCol(\"ner\")\n","\n","ner_converter = nlp.NerConverter()\\\n"," .setInputCols([\"document2\",\"token\",\"ner\"])\\\n"," .setOutputCol(\"ner_chunk\")\n"," \n","nlpPipeline = nlp.Pipeline().setStages([\n"," documentAssembler,\n"," documentHasher,\n"," tokenizer,\n"," embeddings,\n"," ner_model,\n"," ner_converter])\n","\n","empty_data = spark.createDataFrame([[\"\", \"\"]]).toDF(\"text\", \"clientID\")\n","\n","pipeline_model = nlpPipeline.fit(empty_data)"]},{"cell_type":"code","execution_count":36,"id":"477aca34-b392-4dae-946a-4e59e21c6240","metadata":{"scrolled":true,"tags":[],"id":"477aca34-b392-4dae-946a-4e59e21c6240","outputId":"5fd711c2-b6e7-4df5-8454-7f636b9c38a2","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1685531859675,"user_tz":-180,"elapsed":11430,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_id_data.csv\",\n"," output_file_path=\"deidentified.csv\",\n"," custom_pipeline=pipeline_model,\n"," fields={\"text\": \"obfuscate\"},\n"," shift_days=True,\n"," obfuscate_date=True, \n"," ner_chunk=\"ner_chunk\",\n"," token=\"token\",\n"," documenthashcoder_col_name=\"document2\",\n"," separator=\",\",\n"," unnormalized_date=False)\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":37,"id":"506b4f84-972e-4efb-8ecc-963017ba1a5a","metadata":{"id":"506b4f84-972e-4efb-8ecc-963017ba1a5a","colab":{"base_uri":"https://localhost:8080/"},"outputId":"a64ccd3b-4d47-4b57-8395-406425adf1ff","executionInfo":{"status":"ok","timestamp":1685531861037,"user_tz":-180,"elapsed":1364,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+-------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+\n","|ID |text |text_deidentified |\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+\n","|0 |EMPLOYMENT AGREEMENT, effective as of 01/06/2013 |EMPLOYMENT AGREEMENT, effective as of 01/01/2013 |\n","|1 |This First Amendment adopted, effective as of 21/08/2008 |This First Amendment adopted, effective as of 16/08/2008 |\n","|2 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|Amendment to the Employment Agreement between located in Conway, Oklahoma (Bank) and (Executive) by 06/30/2023|\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=False)"]},{"cell_type":"markdown","id":"368b319b-6a86-4e8b-8ecf-2738f7fa1b94","metadata":{"id":"368b319b-6a86-4e8b-8ecf-2738f7fa1b94"},"source":["### shifting days according to specified values: XX/XX/XXXX or textual formats: June 10th, 2023"]},{"cell_type":"code","execution_count":38,"id":"192051fa-e915-4cd2-8ce3-e65e9c0f23e7","metadata":{"id":"192051fa-e915-4cd2-8ce3-e65e9c0f23e7","colab":{"base_uri":"https://localhost:8080/"},"outputId":"f13e46d7-7764-4b56-fa9c-ec8161db20b2","executionInfo":{"status":"ok","timestamp":1685531863209,"user_tz":-180,"elapsed":2175,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+\n","|clientID|text |dateshift|\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+\n","|A001 |EMPLOYMENT AGREEMENT, effective as of 10 June 2013 |10 |\n","|A001 |This First Amendment adopted, effective as of August 8th, 2008 |-2 |\n","|A002 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|30 |\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+\n","\n"]}],"source":["data = pd.DataFrame(\n"," {'clientID' : ['A001', 'A001', 'A002'],\n"," 'text' : ['EMPLOYMENT AGREEMENT, effective as of 10 June 2013', \n"," 'This First Amendment adopted, effective as of August 8th, 2008', \n"," 'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023'\n"," ],\n"," 'dateshift' : ['10', '-2', '30']\n"," }\n",")\n","\n","my_input_df = spark.createDataFrame(data)\n","\n","my_input_df.show(truncate=False)"]},{"cell_type":"code","execution_count":39,"id":"571108db-7c92-477a-9dc2-78b56fb147d7","metadata":{"id":"571108db-7c92-477a-9dc2-78b56fb147d7","executionInfo":{"status":"ok","timestamp":1685531863210,"user_tz":-180,"elapsed":7,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[],"source":["df_pd= my_input_df.toPandas()\n","df_pd.to_csv(\"deid_specific_data.csv\", index=False)"]},{"cell_type":"code","execution_count":40,"id":"56fddf76-6c5c-48a6-b2b3-da2a7636d841","metadata":{"scrolled":true,"tags":[],"id":"56fddf76-6c5c-48a6-b2b3-da2a7636d841","colab":{"base_uri":"https://localhost:8080/"},"outputId":"0c31676f-630f-4a5d-b555-c71289df1681","executionInfo":{"status":"ok","timestamp":1685531870938,"user_tz":-180,"elapsed":7733,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["roberta_embeddings_legal_roberta_base download started this may take some time.\n","Approximate size to download 447.2 MB\n","[OK!]\n","legner_deid download started this may take some time.\n","[OK!]\n"]}],"source":["documentAssembler = nlp.DocumentAssembler()\\\n"," .setInputCol(\"text\")\\\n"," .setOutputCol(\"document\")\n","\n","documentHasher = legal.DocumentHashCoder()\\\n"," .setInputCols(\"document\")\\\n"," .setOutputCol(\"document2\")\\\n"," .setDateShiftColumn(\"dateshift\")\\\n","\n","tokenizer = nlp.Tokenizer()\\\n"," .setInputCols([\"document2\"])\\\n"," .setOutputCol(\"token\")\n","\n","embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\",\"en\") \\\n"," .setInputCols([\"document2\", \"token\"]) \\\n"," .setOutputCol(\"embeddings\")\n","\n","ner_model = legal.NerModel.pretrained('legner_deid', \"en\", \"legal/models\")\\\n"," .setInputCols([\"document2\", \"token\", \"embeddings\"])\\\n"," .setOutputCol(\"ner\")\n","\n","ner_converter = nlp.NerConverter()\\\n"," .setInputCols([\"document2\",\"token\",\"ner\"])\\\n"," .setOutputCol(\"ner_chunk\")\n"," \n","nlpPipeline = nlp.Pipeline().setStages([\n"," documentAssembler,\n"," documentHasher,\n"," tokenizer,\n"," embeddings,\n"," ner_model,\n"," ner_converter])\n","\n","empty_data = spark.createDataFrame([[\"\", \"\", \"\"]]).toDF(\"clientID\",\"text\", \"dateshift\")\n","\n","pipeline_col_model = nlpPipeline.fit(empty_data)"]},{"cell_type":"code","execution_count":41,"id":"a604518f-8149-46ea-8a1d-2d6cbc92a74e","metadata":{"scrolled":true,"tags":[],"id":"a604518f-8149-46ea-8a1d-2d6cbc92a74e","colab":{"base_uri":"https://localhost:8080/"},"outputId":"ec5cc22b-9e23-47e4-820a-8eaedfb14c53","executionInfo":{"status":"ok","timestamp":1685531876322,"user_tz":-180,"elapsed":5398,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deid_specific_data.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_specific_data.csv\",\n"," separator=\",\",\n"," output_file_path=\"deid_specific_data.csv\",\n"," custom_pipeline=pipeline_col_model,\n"," fields={\"text\": \"obfuscate\"},\n"," shift_days=True,\n"," obfuscate_date=True, \n"," ner_chunk=\"ner_chunk\",\n"," token=\"token\",\n"," documenthashcoder_col_name=\"document2\")\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":42,"id":"32fe94b9-12fb-4a16-ba20-090a827e3c58","metadata":{"id":"32fe94b9-12fb-4a16-ba20-090a827e3c58","colab":{"base_uri":"https://localhost:8080/"},"outputId":"2da9e424-f920-4d4d-990f-2965a37526dd","executionInfo":{"status":"ok","timestamp":1685531876771,"user_tz":-180,"elapsed":465,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+-------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+\n","|ID |text |text_deidentified |\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+\n","|0 |EMPLOYMENT AGREEMENT, effective as of 10 June 2013 |EMPLOYMENT AGREEMENT, effective as of 20 June 2013 |\n","|1 |This First Amendment adopted, effective as of August 8th, 2008 |This First Amendment adopted, effective as of August 6th, 2008 |\n","|2 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|Amendment to the Employment Agreement between located in Oakland, South Carolina (Bank) and (Executive) by 07/01/2023|\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=False)"]},{"cell_type":"markdown","id":"3df91aeb-681f-468e-ac67-67020093959e","metadata":{"id":"3df91aeb-681f-468e-ac67-67020093959e"},"source":["### unnormalized date formats"]},{"cell_type":"code","execution_count":43,"id":"12edc84b-f9e7-4bc9-b92f-b8785f84d012","metadata":{"id":"12edc84b-f9e7-4bc9-b92f-b8785f84d012","colab":{"base_uri":"https://localhost:8080/"},"outputId":"07155038-13f3-4269-818b-4c7fa221ec76","executionInfo":{"status":"ok","timestamp":1685531876772,"user_tz":-180,"elapsed":6,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+\n","|clientID|text |dateshift|\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+\n","|A001 |EMPLOYMENT AGREEMENT, effective as of 3May2002 |10 |\n","|A001 |This First Amendment adopted, effective as of Agust 8th, 2008 |-2 |\n","|A002 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|30 |\n","+--------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------+\n","\n"]}],"source":["import pandas as pd\n","\n","data = pd.DataFrame(\n"," {'clientID' : ['A001', 'A001', 'A002'],\n"," 'text' : ['EMPLOYMENT AGREEMENT, effective as of 3May2002', \n"," 'This First Amendment adopted, effective as of Agust 8th, 2008', \n"," 'Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023'\n"," ],\n"," 'dateshift' : ['10', '-2', '30']\n"," }\n",")\n","\n","\n","my_input_df = spark.createDataFrame(data)\n","\n","my_input_df.show(truncate=False)"]},{"cell_type":"code","execution_count":44,"id":"a186c14e-ce61-4f7e-a09d-5f8a46eb41e7","metadata":{"id":"a186c14e-ce61-4f7e-a09d-5f8a46eb41e7","executionInfo":{"status":"ok","timestamp":1685531876772,"user_tz":-180,"elapsed":3,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[],"source":["df_pd = my_input_df.toPandas()\n","df_pd.to_csv(\"deid_unnormalized_data.csv\", index=False)"]},{"cell_type":"code","execution_count":45,"id":"f2ba23c4-5cf7-444b-b253-e2f8163b75e1","metadata":{"scrolled":true,"tags":[],"id":"f2ba23c4-5cf7-444b-b253-e2f8163b75e1","colab":{"base_uri":"https://localhost:8080/"},"outputId":"541ac12d-031c-4fa9-f545-1dff1a6c6030","executionInfo":{"status":"ok","timestamp":1685531880955,"user_tz":-180,"elapsed":4186,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_unnormalized_data.csv\",\n"," output_file_path=\"deidentified.csv\",\n"," custom_pipeline=pipeline_col_model,\n"," fields={\"text\": \"obfuscate\"},\n"," shift_days=True,\n"," obfuscate_date=True, \n"," ner_chunk=\"ner_chunk\",\n"," token=\"token\",\n"," documenthashcoder_col_name=\"document2\",\n"," separator=\",\",\n"," unnormalized_date=True,\n"," unnormalized_mode=\"mask\"\n"," )\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":46,"id":"2cda1462-9605-44f0-9d66-8efa761679b2","metadata":{"id":"2cda1462-9605-44f0-9d66-8efa761679b2","colab":{"base_uri":"https://localhost:8080/"},"outputId":"4ff1c699-38b3-4ba9-e899-48aadd0d5b67","executionInfo":{"status":"ok","timestamp":1685531880955,"user_tz":-180,"elapsed":3,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+-------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+\n","|ID |text |text_deidentified |\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+\n","|0 |EMPLOYMENT AGREEMENT, effective as of 3May2002 |EMPLOYMENT AGREEMENT, effective as of |\n","|1 |This First Amendment adopted, effective as of Agust 8th, 2008 |This First Amendment adopted, effective as of |\n","|2 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|Amendment to the Employment Agreement between located in Toruń, Wyoming (Bank) and (Executive) by 07/01/2023|\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=False)"]},{"cell_type":"markdown","id":"73a4ff14-b5eb-4e48-8659-4c0be03a2318","metadata":{"id":"73a4ff14-b5eb-4e48-8659-4c0be03a2318"},"source":["**unnormalized_mode=\"obfuscate\"**"]},{"cell_type":"code","execution_count":47,"id":"f79238e6-d0a6-4add-894f-4d1d75e6460f","metadata":{"id":"f79238e6-d0a6-4add-894f-4d1d75e6460f","colab":{"base_uri":"https://localhost:8080/"},"outputId":"622f3c37-f451-481e-81df-4b8c57ef92c8","executionInfo":{"status":"ok","timestamp":1685531883661,"user_tz":-180,"elapsed":2707,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Deidentification process of the 'text' field has begun...\n","Deidentification process of the 'text' field was completed...\n","Deidentifcation successfully completed and the results saved as 'deidentified1.csv' !\n"]}],"source":["deid_implementor = legal.Deid(spark,\n"," input_file_path=\"deid_unnormalized_data.csv\",\n"," output_file_path=\"deidentified1.csv\",\n"," custom_pipeline=pipeline_col_model,\n"," fields={\"text\": \"obfuscate\"},\n"," shift_days=True,\n"," obfuscate_date=True, \n"," ner_chunk=\"ner_chunk\",\n"," token=\"token\",\n"," documenthashcoder_col_name=\"document2\",\n"," separator=\",\",\n"," unnormalized_date=True,\n"," unnormalized_mode=\"obfuscate\"\n"," )\n","\n","res = deid_implementor.deidentify()"]},{"cell_type":"code","execution_count":48,"id":"82fa23cd-85fc-4351-b116-40dcab242fe9","metadata":{"id":"82fa23cd-85fc-4351-b116-40dcab242fe9","colab":{"base_uri":"https://localhost:8080/"},"outputId":"f3e367da-6c25-4ce1-ced5-59af7a747a2f","executionInfo":{"status":"ok","timestamp":1685531884295,"user_tz":-180,"elapsed":649,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+\n","|ID |text |text_deidentified |\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+\n","|0 |EMPLOYMENT AGREEMENT, effective as of 3May2002 |EMPLOYMENT AGREEMENT, effective as of 04-04-1987 |\n","|1 |This First Amendment adopted, effective as of Agust 8th, 2008 |This First Amendment adopted, effective as of 11-03-2000 |\n","|2 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|Amendment to the Employment Agreement between located in Cushing, Iowa (Bank) and (Executive) by 07/01/2023|\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["res.show(truncate=False)"]},{"cell_type":"markdown","source":["# Default pipeline for Legal domain\n"," This pipeline does not include the masking of DOCUMENT TYPE."],"metadata":{"id":"yuU6fMFIEFN-"},"id":"yuU6fMFIEFN-"},{"cell_type":"code","source":["deid_implementor = legal.Deid(spark,\n"," ner_chunk = \"merged_ner_chunks\",\n"," input_file_path=\"deid_data.csv\",\n"," output_file_path=\"deidentified_custompipe.csv\",\n"," domain=\"legal\")"],"metadata":{"id":"1FKxTQirK3bP","executionInfo":{"status":"ok","timestamp":1685531884296,"user_tz":-180,"elapsed":10,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"1FKxTQirK3bP","execution_count":49,"outputs":[]},{"cell_type":"code","source":["res.show(truncate=False)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Fy1AZenYlMLR","outputId":"24f468c3-3f7f-48e7-c1c6-cf1dcab62290","executionInfo":{"status":"ok","timestamp":1685531884296,"user_tz":-180,"elapsed":9,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"Fy1AZenYlMLR","execution_count":50,"outputs":[{"output_type":"stream","name":"stdout","text":["+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+\n","|ID |text |text_deidentified |\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+\n","|0 |EMPLOYMENT AGREEMENT, effective as of 3May2002 |EMPLOYMENT AGREEMENT, effective as of 04-04-1987 |\n","|1 |This First Amendment adopted, effective as of Agust 8th, 2008 |This First Amendment adopted, effective as of 11-03-2000 |\n","|2 |Amendment to the Employment Agreement between Service 1st Bank located in Stockton, California (Bank) and John E. Smith (Executive) by 06/01/2023|Amendment to the Employment Agreement between located in Cushing, Iowa (Bank) and (Executive) by 07/01/2023|\n","+---+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+\n","\n"]}]},{"cell_type":"markdown","id":"aa5f5af0-ad93-4542-b86c-85128718a374","metadata":{"id":"aa5f5af0-ad93-4542-b86c-85128718a374"},"source":["# Structured Deidentification"]},{"cell_type":"code","source":["# sample data\n","! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/hipaa-table-001.txt\n","\n","df = spark.read.format(\"csv\") \\\n"," .option(\"sep\", \"\\t\") \\\n"," .option(\"inferSchema\", \"true\") \\\n"," .option(\"header\", \"true\") \\\n"," .load(\"hipaa-table-001.txt\")\n","\n","df.show(truncate=False)"],"metadata":{"id":"GBf31JxQFlQN","colab":{"base_uri":"https://localhost:8080/"},"outputId":"0d5c2765-a209-4f8f-915f-b0aac2468639","executionInfo":{"status":"ok","timestamp":1685536190877,"user_tz":-180,"elapsed":411,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"GBf31JxQFlQN","execution_count":76,"outputs":[{"output_type":"stream","name":"stdout","text":["+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+\n","|NAME |DOB |AGE|ADDRESS |ZIPCODE|TEL |SBP|DBP|\n","+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+\n","|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi |69200 |(257) 563-7401|101|42 |\n","|Iris Watson |03/10/2009|9 |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska |20620 |(372) 587-2335|159|122|\n","|Bryar Pitts |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA |20783 |(717) 450-4729|149|52 |\n","|Theodore Lowe |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York |39531 |(793) 151-6230|134|115|\n","|Calista Wise |20/08/1942|76 |7292 Dictum Av. San Antonio MI |47096 |(492) 709-6392|139|78 |\n","|Kyla Olsen |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamuning PA |10855 |(654) 393-5734|120|112|\n","|Forrest Ray |11/01/1991|27 |191-103 Integer Rd. Corona New Mexico |8219 |(404) 960-3807|143|126|\n","|Hiroko Potter |18/11/1937|81 |P.O. Box 887 2508 Dolor. Av. Muskegon KY |12482 |(314) 244-6306|147|75 |\n","|Celeste Slater |12/05/1980|38 |606-3727 Ullamcorper. Street Roseville NH |11523 |(786) 713-8616|147|123|\n","|Nyssa Vazquez |24/09/1956|62 |511-5762 At Rd. Chelsea MI |67708 |(947) 278-5929|129|50 |\n","|Lawrence Moreno|26/12/1906|112|935-9940 Tortor. Street Santa Rosa MN |98804 |(684) 579-1879|133|102|\n","|Ina Moran |26/10/1983|35 |P.O. Box 929 4189 Nunc Road Lebanon KY |69409 |(389) 737-2852|101|67 |\n","|Aaron Hawkins |26/09/2009|9 |5587 Nunc. Avenue Erie Rhode Island |24975 |(660) 663-4518|87 |81 |\n","|Hedy Greene |03/10/1920|98 |Ap #696-3279 Viverra. Avenue Latrobe DE |38100 |(608) 265-2215|128|123|\n","|Melvin Porter |14/08/1911|107|P.O. Box 132 1599 Curabitur Rd. Bandera South Dakota|45149 |(959) 119-8364|83 |43 |\n","|Keefe Sellers |16/05/1937|81 |347-7666 Iaculis St. Woodruff SC |49854 |(468) 353-2641|148|109|\n","|Joan Romero |08/12/2004|14 |666-4366 Lacinia Avenue Idaho Falls Ohio |19253 |(248) 675-4007|75 |53 |\n","|Davis Patrick |09/01/1956|63 |P.O. Box 147 2546 Sociosqu Rd. Bethlehem Utah |2913 |(939) 353-1107|142|62 |\n","|Leilani Boyer |18/10/1934|84 |557-6308 Lacinia Road San Bernardino ND |9289 |(570) 873-7090|137|48 |\n","|Colby Bernard |02/10/1905|113|Ap #285-7193 Ullamcorper Avenue Amesbury HI |93373 |(302) 259-2375|84 |41 |\n","+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+\n","only showing top 20 rows\n","\n"]}]},{"cell_type":"markdown","id":"bfc841f6-4bc8-4a7c-9cf4-fc32f17afd67","metadata":{"id":"bfc841f6-4bc8-4a7c-9cf4-fc32f17afd67"},"source":["## Default parameters"]},{"cell_type":"code","source":["obfuscator = legal.StructuredDeidentification(spark,{\"NAME\":\"NAME\",\"AGE\":\"AGE\"}, obfuscateRefSource = \"faker\")\n","obfuscator_df = obfuscator.obfuscateColumns(df)\n","obfuscator_df.show(truncate=False)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"qiVcI_oFh9sK","executionInfo":{"status":"ok","timestamp":1685536194570,"user_tz":-180,"elapsed":460,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}},"outputId":"764199ec-62f2-4f19-dff9-6e6aec7498e4"},"id":"qiVcI_oFh9sK","execution_count":77,"outputs":[{"output_type":"stream","name":"stdout","text":["+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+\n","|NAME |DOB |AGE |ADDRESS |ZIPCODE|TEL |SBP|DBP|\n","+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+\n","|[Natasha Bence] |04/02/1935|[93] |711-2880 Nulla St. Mankato Mississippi |69200 |(257) 563-7401|101|42 |\n","|[Karie Chimera] |03/10/2009|[5] |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska |20620 |(372) 587-2335|159|122|\n","|[Lucita Ferrara] |11/01/1921|[93] |5543 Aliquet St. Fort Dodge GA |20783 |(717) 450-4729|149|52 |\n","|[Lorane Gell] |13/02/2002|[12] |Ap #867-859 Sit Rd. Azusa New York |39531 |(793) 151-6230|134|115|\n","|[Lowell Guitar] |20/08/1942|[73] |7292 Dictum Av. San Antonio MI |47096 |(492) 709-6392|139|78 |\n","|[Jimmey Ralph] |12/05/1973|[46] |Ap #651-8679 Sodales Av. Tamuning PA |10855 |(654) 393-5734|120|112|\n","|[Gwendlyn Deutscher]|11/01/1991|[30] |191-103 Integer Rd. Corona New Mexico |8219 |(404) 960-3807|143|126|\n","|[Doyle Askew] |18/11/1937|[84] |P.O. Box 887 2508 Dolor. Av. Muskegon KY |12482 |(314) 244-6306|147|75 |\n","|[Volanda Napoleon] |12/05/1980|[37] |606-3727 Ullamcorper. Street Roseville NH |11523 |(786) 713-8616|147|123|\n","|[Curt Bears] |24/09/1956|[72] |511-5762 At Rd. Chelsea MI |67708 |(947) 278-5929|129|50 |\n","|[Johna Sheriff] |26/12/1906|[106]|935-9940 Tortor. Street Santa Rosa MN |98804 |(684) 579-1879|133|102|\n","|[Nedra Hai] |26/10/1983|[22] |P.O. Box 929 4189 Nunc Road Lebanon KY |69409 |(389) 737-2852|101|67 |\n","|[Lady Saucier] |26/09/2009|[5] |5587 Nunc. Avenue Erie Rhode Island |24975 |(660) 663-4518|87 |81 |\n","|[Gracy Bruins] |03/10/1920|[93] |Ap #696-3279 Viverra. Avenue Latrobe DE |38100 |(608) 265-2215|128|123|\n","|[Lennie Hummer] |14/08/1911|[103]|P.O. Box 132 1599 Curabitur Rd. Bandera South Dakota|45149 |(959) 119-8364|83 |43 |\n","|[Harvel Ricks] |16/05/1937|[84] |347-7666 Iaculis St. Woodruff SC |49854 |(468) 353-2641|148|109|\n","|[Rip Harbour] |08/12/2004|[16] |666-4366 Lacinia Avenue Idaho Falls Ohio |19253 |(248) 675-4007|75 |53 |\n","|[Steve Rattler] |09/01/1956|[68] |P.O. Box 147 2546 Sociosqu Rd. Bethlehem Utah |2913 |(939) 353-1107|142|62 |\n","|[Dell Ponto] |18/10/1934|[82] |557-6308 Lacinia Road San Bernardino ND |9289 |(570) 873-7090|137|48 |\n","|[Jaclyn Prime] |02/10/1905|[118]|Ap #285-7193 Ullamcorper Avenue Amesbury HI |93373 |(302) 259-2375|84 |41 |\n","+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+\n","only showing top 20 rows\n","\n"]}]},{"cell_type":"markdown","id":"aba6745f-a6e5-455a-a4b3-6cd334514242","metadata":{"id":"aba6745f-a6e5-455a-a4b3-6cd334514242"},"source":["## ref_source=File"]},{"cell_type":"code","source":["obfuscator_unique_ref_test = '''Will Perry#NAME\n","John Smith#NAME\n","Marvin MARSHALL#NAME\n","Hubert GROGAN#NAME\n","ALTHEA COLBURN#NAME\n","Kalil AMIN#NAME\n","Inci FOUNTAIN#NAME\n","Jackson WILLE#NAME\n","Jack SANTOS#NAME\n","Mahmood ALBURN#NAME\n","Marnie MELINGTON#NAME\n","Aysha GHAZI#NAME\n","Maryland CODER#NAME\n","Darene GEORGIOUS#NAME\n","Shelly WELLBECK#NAME\n","Min Kun JAE#NAME\n","Thomson THOMAS#NAME\n","Christian SUDDINBURG#NAME\n","20#AGE\n","30#AGE\n","40#AGE\n","50#AGE\n","60#AGE\n","(901)111-2222#TEL\n","(109)333 1343#TEL\n","(570) 874-1112#TEL\n","(901)111-2222#TEL\n","(109)333 1343#TEL\n","(570) 874-1112#TEL\n","28450#ZIPCODE\n","49144#ZIPCODE\n","14412#ZIPCODE\n","10/10/1983#DOB\n","04/06/1990#DOB\n","03/11/2001#DOB\n","'''\n","\n","with open('obfuscator_unique_ref_test.txt', 'w') as f:\n"," f.write(obfuscator_unique_ref_test)"],"metadata":{"id":"nq4Ow6jvGEHB","executionInfo":{"status":"ok","timestamp":1685536229244,"user_tz":-180,"elapsed":847,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"nq4Ow6jvGEHB","execution_count":78,"outputs":[]},{"cell_type":"code","execution_count":79,"id":"0c365708-8870-47c8-87e1-50bd72e3741e","metadata":{"id":"0c365708-8870-47c8-87e1-50bd72e3741e","colab":{"base_uri":"https://localhost:8080/"},"outputId":"7b55d706-b789-4b96-8833-bd5fd46fd61d","executionInfo":{"status":"ok","timestamp":1685536232559,"user_tz":-180,"elapsed":780,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+------------------+----+\n","|NAME |AGE |\n","+------------------+----+\n","|[Inci FOUNTAIN] |[60]|\n","|[Jack SANTOS] |[30]|\n","|[Darene GEORGIOUS]|[30]|\n","|[Shelly WELLBECK] |[40]|\n","|[Hubert GROGAN] |[40]|\n","|[Kalil AMIN] |[40]|\n","|[ALTHEA COLBURN] |[60]|\n","|[Thomson THOMAS] |[60]|\n","|[Jack SANTOS] |[60]|\n","|[Will Perry] |[20]|\n","|[Jackson WILLE] |[60]|\n","|[Shelly WELLBECK] |[40]|\n","|[Kalil AMIN] |[30]|\n","|[Marnie MELINGTON]|[30]|\n","|[Min Kun JAE] |[30]|\n","|[Marvin MARSHALL] |[60]|\n","|[Marvin MARSHALL] |[50]|\n","|[Min Kun JAE] |[30]|\n","|[Maryland CODER] |[20]|\n","|[Marnie MELINGTON]|[20]|\n","+------------------+----+\n","only showing top 20 rows\n","\n"]}],"source":["obfuscator = legal.StructuredDeidentification(spark,{\"NAME\":\"NAME\",\"AGE\":\"AGE\"}, \n"," obfuscateRefFile = \"/content/obfuscator_unique_ref_test.txt\",\n"," obfuscateRefSource = \"file\",\n"," columnsSeed={\"NAME\": 23, \"AGE\": 23})\n","obfuscator_df = obfuscator.obfuscateColumns(df)\n","\n","obfuscator_df.select(\"NAME\",\"AGE\").show(truncate=False)"]},{"cell_type":"markdown","id":"48155ef4-fe14-4372-8d2b-9cda5086acc2","metadata":{"id":"48155ef4-fe14-4372-8d2b-9cda5086acc2"},"source":["## shift days"]},{"cell_type":"code","execution_count":80,"id":"01cef8d9-982f-41f3-91a6-2808942b267d","metadata":{"id":"01cef8d9-982f-41f3-91a6-2808942b267d","colab":{"base_uri":"https://localhost:8080/","height":143},"outputId":"395d3ee9-4e79-4cdd-d4f4-05b0d9d51d4d","executionInfo":{"status":"ok","timestamp":1685536245300,"user_tz":-180,"elapsed":1085,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" NAME DOB ADDRESS SBP TEL\n","0 Juan García 13/02/1977 711 Nulla St. 140 673 431234\n","1 Will Smith 23/02/1977 1 Green Avenue. 140 +23 (673) 431234\n","2 Pedro Ximénez 11/04/1900 Calle del Libertador, 7 100 912 345623"],"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
NAMEDOBADDRESSSBPTEL
0Juan García13/02/1977711 Nulla St.140673 431234
1Will Smith23/02/19771 Green Avenue.140+23 (673) 431234
2Pedro Ximénez11/04/1900Calle del Libertador, 7100912 345623
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":80}],"source":["# We can shift n days in the structured deidentification through \"days\" parameter when the column is a Date.\n","\n","df = spark.createDataFrame([\n"," [\"Juan García\", \"13/02/1977\", \"711 Nulla St.\", \"140\", \"673 431234\"],\n"," [\"Will Smith\", \"23/02/1977\", \"1 Green Avenue.\", \"140\", \"+23 (673) 431234\"],\n"," [\"Pedro Ximénez\", \"11/04/1900\", \"Calle del Libertador, 7\", \"100\", \"912 345623\"]\n"," ]).toDF(\"NAME\", \"DOB\", \"ADDRESS\", \"SBP\", \"TEL\")\n","\n","df_pd= df.toPandas()\n","df_pd.to_csv(\"deid_dayshift_structured_data.csv\", index=False)\n","df_pd.head()"]},{"cell_type":"code","source":["obfuscator = legal.StructuredDeidentification(spark, \n"," columns = {\"NAME\": \"ID\", \"DOB\": \"DATE\"},\n"," obfuscateRefSource = \"faker\",\n"," columnsSeed={\"NAME\": 23, \"AGE\": 23},\n"," days = 5)\n","obfuscator_df = obfuscator.obfuscateColumns(df)"],"metadata":{"id":"NKRIdMHpyxUP","executionInfo":{"status":"ok","timestamp":1685536440041,"user_tz":-180,"elapsed":2,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"id":"NKRIdMHpyxUP","execution_count":82,"outputs":[]},{"cell_type":"code","execution_count":83,"id":"db013741-9710-415f-b1ef-b6870ed814cd","metadata":{"id":"db013741-9710-415f-b1ef-b6870ed814cd","colab":{"base_uri":"https://localhost:8080/"},"outputId":"b5afdd0f-d82e-4d1d-8b0b-6db2a0ae3a47","executionInfo":{"status":"ok","timestamp":1685536456024,"user_tz":-180,"elapsed":1162,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+----------+------------+-----------------------+---+----------------+\n","|NAME |DOB |ADDRESS |SBP|TEL |\n","+----------+------------+-----------------------+---+----------------+\n","|[G9296129]|[18/02/1977]|711 Nulla St. |140|673 431234 |\n","|[M9239301]|[28/02/1977]|1 Green Avenue. |140|+23 (673) 431234|\n","|[H3156881]|[16/04/1900]|Calle del Libertador, 7|100|912 345623 |\n","+----------+------------+-----------------------+---+----------------+\n","\n"]}],"source":["obfuscator_df.show(truncate=False)"]},{"cell_type":"code","source":[],"metadata":{"id":"C1-OUXb7zTUg"},"id":"C1-OUXb7zTUg","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"},"toc-autonumbering":true,"colab":{"provenance":[{"file_id":"https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/11.1.Deidentification_Utility_Module.ipynb","timestamp":1683194701833}],"toc_visible":true},"gpuClass":"standard"},"nbformat":4,"nbformat_minor":5}