{ "cells": [ { "cell_type": "markdown", "id": "1a17817b-ca1b-425a-ba6b-ae879faa0e9d", "metadata": { "id": "1a17817b-ca1b-425a-ba6b-ae879faa0e9d" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "21e9eafb", "metadata": { "id": "21e9eafb" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/11.1.Pretrained_Deidentification_Pipeline.ipynb)" ] }, { "cell_type": "markdown", "id": "fb2ac5e3-fe7c-4431-bb24-3c48649be54b", "metadata": { "id": "fb2ac5e3-fe7c-4431-bb24-3c48649be54b" }, "source": [ "# Financial Deidentification" ] }, { "cell_type": "markdown", "id": "4iIO6G_B3pqq", "metadata": { "collapsed": false, "id": "4iIO6G_B3pqq" }, "source": [ "# Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "hPwo4Czy3pqq", "metadata": { "id": "hPwo4Czy3pqq", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "## Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "_L-7mLYp3pqr", "metadata": { "id": "_L-7mLYp3pqr", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "## Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "# Starting" ] }, { "cell_type": "code", "execution_count": 3, "id": "x3jVICoa3pqr", "metadata": { "id": "x3jVICoa3pqr", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "7059dfdd-1b1a-4178-b5b6-077f9b51f9c2" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "👌 Launched \u001b[92mcpu optimized\u001b[39m session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2\n" ] } ], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "615a3cf3-ae62-40cf-b9b2-a68c8053c908", "metadata": { "id": "615a3cf3-ae62-40cf-b9b2-a68c8053c908" }, "source": [ "# Pretrained Deidentification Pipeline\n", "\n", "We have this pipeline can be used to deidentify financial information from texts.The financial information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `DOC`, `EFFDATE`, `PARTY`, `ALIAS`, `SIGNING_PERSON`, `SIGNING_TITLE`, `COUNTRY`, `CITY`, `STATE`, `STREET`, `ZIP`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `DATE`,`PHONE` entities." ] }, { "cell_type": "code", "execution_count": 4, "id": "994c32d7-44e7-4637-9f64-454af9168dab", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "994c32d7-44e7-4637-9f64-454af9168dab", "outputId": "0a92782a-41c6-4a92-a676-bdca7dee38c4" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "finpipe_deid download started this may take some time.\n", "Approx size to download 437.3 MB\n", "[OK!]\n" ] } ], "source": [ "deid_pipeline = nlp.PretrainedPipeline(\"finpipe_deid\", \"en\", \"finance/models\")\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "a09f14d8-54d9-4ed9-a7c4-6074f421020a", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "a09f14d8-54d9-4ed9-a7c4-6074f421020a", "outputId": "2ffe95d7-22fd-47ca-e715-de4177ffe548" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[DocumentAssembler_20aaea0b09c9,\n", " SentenceDetector_f836f3c49dd7,\n", " REGEX_TOKENIZER_3d88a1dee1d9,\n", " BERT_EMBEDDINGS_29ce72cd673e,\n", " FinanceNerModel_1e04a0ea86dc,\n", " NER_CONVERTER_053dc2c885dc,\n", " FinanceNerModel_99ecfbac41c1,\n", " NER_CONVERTER_c31e7133c116,\n", " FinanceNerModel_fae1a65403a6,\n", " NER_CONVERTER_e54c4e5afd15,\n", " CONTEXTUAL-PARSER_72fff5ea72a3,\n", " CONTEXTUAL-PARSER_247b3d47153a,\n", " CONTEXTUAL-PARSER_8804c3848e07,\n", " CONTEXTUAL-PARSER_138e93ac7638,\n", " CONTEXTUAL-PARSER_222a1bc3dc39,\n", " MERGE_72dccb34a947,\n", " DE-IDENTIFICATION_95319986720c,\n", " DE-IDENTIFICATION_e98c1ba6424c,\n", " DE-IDENTIFICATION_b423b4e6a14e,\n", " DE-IDENTIFICATION_d6ea024c8838]" ] }, "metadata": {}, "execution_count": 5 } ], "source": [ "deid_pipeline.model.stages" ] }, { "cell_type": "code", "execution_count": 6, "id": "00a6b038-7178-4fe3-b7f6-664bad53ff92", "metadata": { "id": "00a6b038-7178-4fe3-b7f6-664bad53ff92" }, "outputs": [], "source": [ "text= \"\"\" REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF\n", "Commvault Systems, Inc. \n", "(Exact name of registrant as specified in its charter) \n", "Signed By : Sherly Johnson\n", "(Address of principal executive offices, including zip code) \n", "(732) 870-4000\n", "(telephone number, including area code) \n", "Name of each exchange on which registered\n", "CVLT\n", "The NASDAQ Stock Market\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 7, "id": "bd76c039-f1bb-4839-a2c5-9b5715629700", "metadata": { "id": "bd76c039-f1bb-4839-a2c5-9b5715629700" }, "outputs": [], "source": [ "deid_res= deid_pipeline.annotate(text)" ] }, { "cell_type": "code", "execution_count": 8, "id": "84add88b-9686-4fcc-83fa-6f64465f4aff", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "84add88b-9686-4fcc-83fa-6f64465f4aff", "outputId": "8521f31e-4dcb-4f82-df37-4b5400ef1b5b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "dict_keys(['obfuscated', 'ner_10k_chunk', 'email', 'document', 'ner_signers_chunk', 'deidentified', 'alias', 'chiefs', 'masked_fixed_length_chars', 'token', 'ner_signers', 'ner_generic_chunk', 'embeddings', 'merged_ner_chunks', 'ner_10k', 'sentence', 'phone', 'orgs', 'masked_with_chars', 'ner_generic'])" ] }, "metadata": {}, "execution_count": 8 } ], "source": [ "deid_res.keys()" ] }, { "cell_type": "code", "execution_count": 9, "id": "9551bf51-c2e4-4d7c-968e-1c75bd43f888", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 326 }, "id": "9551bf51-c2e4-4d7c-968e-1c75bd43f888", "outputId": "2365814e-22e8-44b1-94d3-560235f9deaa" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Sentence \\\n", "0 REPORT PURSUANT TO SECTION 13 OR 15 \n", "1 (d) OF THE SECURITIES EXCHANGE ACT OF\\nCommvault Systems, Inc. \n", "2 (Exact name of registrant as specified in its charter) \\nSigned By : Sherly Johnson\\n(Address of... \n", "\n", " Masked \\\n", "0 REPORT PURSUANT TO SECTION 13 OR 15 \n", "1 (d) OF . \n", "2 (Exact name of registrant as specified in its charter) \\nSigned By : \\n(Address of princ... \n", "\n", " Masked with Chars \\\n", "0 REPORT PURSUANT TO SECTION 13 OR 15 \n", "1 (d) OF [***************************************************]. \n", "2 (Exact name of registrant as specified in its charter) \\nSigned By : [************]\\n(Address of... \n", "\n", " Masked with Fixed Chars \\\n", "0 REPORT PURSUANT TO SECTION 13 OR 15 \n", "1 (d) OF ****. \n", "2 (Exact name of registrant as specified in its charter) \\nSigned By : ****\\n(Address of principal... \n", "\n", " Obfuscated \n", "0 REPORT PURSUANT TO SECTION 13 OR 15 \n", "1 (d) OF Gillespie Inc. \n", "2 (Exact name of registrant as specified in its charter) \\nSigned By : Ashley Patrick\\n(Address of... " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SentenceMaskedMasked with CharsMasked with Fixed CharsObfuscated
0REPORT PURSUANT TO SECTION 13 OR 15REPORT PURSUANT TO SECTION 13 OR 15REPORT PURSUANT TO SECTION 13 OR 15REPORT PURSUANT TO SECTION 13 OR 15REPORT PURSUANT TO SECTION 13 OR 15
1(d) OF THE SECURITIES EXCHANGE ACT OF\\nCommvault Systems, Inc.(d) OF <ORG>.(d) OF [***************************************************].(d) OF ****.(d) OF Gillespie Inc.
2(Exact name of registrant as specified in its charter) \\nSigned By : Sherly Johnson\\n(Address of...(Exact name of registrant as specified in its charter) \\nSigned By : <PERSON>\\n(Address of princ...(Exact name of registrant as specified in its charter) \\nSigned By : [************]\\n(Address of...(Exact name of registrant as specified in its charter) \\nSigned By : ****\\n(Address of principal...(Exact name of registrant as specified in its charter) \\nSigned By : Ashley Patrick\\n(Address of...
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 9 } ], "source": [ "import pandas as pd\n", "\n", "pd.set_option(\"display.max_colwidth\", 100)\n", "\n", "df= pd.DataFrame(list(zip(deid_res[\"sentence\"], \n", " deid_res[\"deidentified\"],\n", " deid_res[\"masked_with_chars\"],\n", " deid_res[\"masked_fixed_length_chars\"], \n", " deid_res[\"obfuscated\"])),\n", " columns= [\"Sentence\", \"Masked\", \"Masked with Chars\", \"Masked with Fixed Chars\", \"Obfuscated\"])\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "H4FsbEuoCKg6", "metadata": { "id": "H4FsbEuoCKg6" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6 (default, Oct 18 2022, 12:41:40) \n[Clang 14.0.0 (clang-1400.0.29.202)]" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }