{
"cells": [
{
"cell_type": "markdown",
"id": "wxZDXLDCXkk_",
"metadata": {
"id": "wxZDXLDCXkk_"
},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "pZ6sKi8ZX1z4",
"metadata": {
"id": "pZ6sKi8ZX1z4"
},
"source": [
"[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/05.1.Training_Financial_NER.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "KLqW6FOnEvov",
"metadata": {
"id": "KLqW6FOnEvov"
},
"source": [
"#π Training Financial NER\n"
]
},
{
"cell_type": "markdown",
"id": "Yjl-5MGlx0dF",
"metadata": {
"collapsed": false,
"id": "Yjl-5MGlx0dF"
},
"source": [
"#π¬ Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "MjgJyCCIx0dP",
"metadata": {
"id": "MjgJyCCIx0dP",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"! pip install -q johnsnowlabs"
]
},
{
"cell_type": "markdown",
"id": "7bJI_ekTx0dQ",
"metadata": {
"id": "7bJI_ekTx0dQ"
},
"source": [
"##π Automatic Installation\n",
"Using [my.johnsnowlabs.com](https://my.johnsnowlabs.com/) SSO"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "DEl2pY8Lx0dQ",
"metadata": {
"id": "DEl2pY8Lx0dQ",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, finance\n",
"\n",
"# nlp.install(force_browser=True)"
]
},
{
"cell_type": "markdown",
"id": "zKIDRSiOx0dQ",
"metadata": {
"id": "zKIDRSiOx0dQ"
},
"source": [
"##π Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to [my.johnsnowlabs.com](https://my.johnsnowlabs.com/)\n",
"- Download your license\n",
"- Upload it using the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7iXsGZrIx0dQ",
"metadata": {
"id": "7iXsGZrIx0dQ"
},
"outputs": [],
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
]
},
{
"cell_type": "markdown",
"id": "PUlLsDgkx0dQ",
"metadata": {
"id": "PUlLsDgkx0dQ"
},
"source": [
"- Install it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2NA8ka6Fx0dQ",
"metadata": {
"id": "2NA8ka6Fx0dQ"
},
"outputs": [],
"source": [
"nlp.install()"
]
},
{
"cell_type": "markdown",
"id": "S4mvOi6jwlcr",
"metadata": {
"id": "S4mvOi6jwlcr"
},
"source": [
"##π Start Spark Session"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "TZIjuI3zN1Oi",
"metadata": {
"id": "TZIjuI3zN1Oi"
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, finance\n",
"# Automatically load license data and start a session with all jars user has access to\n",
"spark = nlp.start()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "YeIQqpP6KkW9",
"metadata": {
"id": "YeIQqpP6KkW9"
},
"outputs": [],
"source": [
"from pyspark.sql import DataFrame\n",
"import pyspark.sql.functions as F\n",
"import pyspark.sql.types as T\n",
"import pyspark.sql as SQL\n",
"from pyspark import keyword_only"
]
},
{
"cell_type": "markdown",
"id": "N4QLNrIdB0Ex",
"metadata": {
"id": "N4QLNrIdB0Ex"
},
"source": [
"##π Training a custom NerModel"
]
},
{
"cell_type": "markdown",
"id": "KeDpXEXDBvYk",
"metadata": {
"id": "KeDpXEXDBvYk"
},
"source": [
"\n",
"πThe model was trained in the available [Tweets dataset](https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020), with data from 2015 to 2020. \n",
"\n",
"If your appliation needs different entities than the provided pretrained models can identify, what you can do is to train a new model that fits your requirements. To do that you first need to collect and label enough data and put them in the CoNLL 2003 format. If you are not sure how to annotate (label) text data and prepare it in the CoNLL 2003 format, try our free tool [Annotation Lab](https://nlp.johnsnowlabs.com/docs/en/alab/quickstart), where you can easily label text data and export in the correct format for training.\n",
"\n",
"For our purposes here, we will use a sample file annotated by our team."
]
},
{
"cell_type": "markdown",
"id": "JYBQyxEd0uR0",
"metadata": {
"id": "JYBQyxEd0uR0"
},
"source": [
"###βοΈ CoNLL Data Prep \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "AVBmGFcQ03La",
"metadata": {
"id": "AVBmGFcQ03La"
},
"outputs": [],
"source": [
"! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/conll_noO.conll"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "-JxIUBKV1GJS",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-JxIUBKV1GJS",
"outputId": "b825ea2c-87e9-4f7d-92c0-6c9f7c0aa613"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"( NN NN O\n",
"d NN NN O\n",
") NN NN O\n",
"OF NN NN O\n",
"THE NN NN O\n",
"SECURITIES NN NN O\n",
"EXCHANGE NN NN O\n",
"ACT NN NN O\n",
"OF NN NN O\n",
"1934 NN NN O\n",
"For NN NN O\n",
"the NN NN O\n",
"annual NN NN O\n",
"period NN NN O\n",
"ended NN NN O\n",
"March NNP NNP B-FISCAL_YEAR\n",
"31 NNP NNP I-FISCAL_YEAR\n",
", NNP NNP I-FISCAL_YEAR\n",
"2021 NNP NNP I-FISCAL_YEAR\n",
"March NNP NNP B-FISCAL_YEAR\n",
"31 NNP NNP I-FISCAL_YEAR\n",
", NNP NNP I-FISCAL_YEAR\n",
"2021 NNP NNP I-FISCAL_YEAR\n",
"β NN NN O\n",
"TRANSITION NN NN O\n",
"REPORT NN NN O\n",
"UNDER NN NN O\n",
"SECTION NN NN O\n",
"13 NN NN O\n",
"OR NN NN O\n",
"15 \n"
]
}
],
"source": [
"with open(\"./conll_noO.conll\") as f:\n",
" train_txt =f.read()\n",
"\n",
"print(train_txt[:500])"
]
},
{
"cell_type": "markdown",
"id": "jb7xQ6EdCElD",
"metadata": {
"id": "jb7xQ6EdCElD"
},
"source": [
"The pipeline is similar to the `NerModel` one, but instead of a `AnnotatorModel`, we use an `AnnotatorApproach` object to train the model. If these concepts of annotator and model is not familiar to you, please review the documentation [here](https://nlp.johnsnowlabs.com/docs/en/concepts).\n",
"\n",
"To load the data into spark dataframe, you can use the [CoNLL](https://nlp.johnsnowlabs.com/docs/en/training#conll-dataset) helper."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "DSEC5CTIIPjK",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DSEC5CTIIPjK",
"outputId": "f0595f93-5fbc-4895-c30a-cb827a7dcf58"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
n",
"|text |tokens |pos |label |\n",
n",
"|( d ) OF THE SECURITIES EXCHANGE ACT OF 1934 For the annual period ended March 31 , 2021 March 31 , 2021 β TRANSITION REPORT UNDER SECTION 13 OR 15|[(, d, ), OF, THE, SECURITIES, EXCHANGE, ACT, OF, 1934, For, the, annual, period, ended, March, 31, ,, 2021, March, 31, ,, 2021, β, TRANSITION, REPORT, UNDER, SECTION, 13, OR, 15]|[NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NNP, NNP, NNP, NNP, NNP, NNP, NNP, NNP, NN, NN, NN, NN, NN, NN, NN, NN]|[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-FISCAL_YEAR, I-FISCAL_YEAR, I-FISCAL_YEAR, I-FISCAL_YEAR, B-FISCAL_YEAR, I-FISCAL_YEAR, I-FISCAL_YEAR, I-FISCAL_YEAR, O, O, O, O, O, O, O, O]|\n",
"|ο»Ώ COMPANY BACKGROUND ο»Ώ Evolving Systems was founded in 1985 to provide software and services to the U.S . telecommunications industry . |[ο»Ώ, COMPANY, BACKGROUND, ο»Ώ, Evolving, Systems, was, founded, in, 1985, to, provide, software, and, services, to, the, U.S, ., telecommunications, industry, .] |[NN, NN, NN, NN, NN, NN, NN, NN, NN, NNP, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN] |[O, O, O, O, O, O, O, O, O, B-DATE, O, O, O, O, O, O, O, O, O, O, O, O] |\n",
"|In November 2004 , we expanded our product set and geographical reach with the acquisition of Tertio Telecoms Ltd . |[In, November, 2004, ,, we, expanded, our, product, set, and, geographical, reach, with, the, acquisition, of, Tertio, Telecoms, Ltd, .] |[NN, NNP, NNP, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN, NN] |[O, B-DATE, I-DATE, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O] |\n",
n",
"only showing top 3 rows\n",
"\n"
]
}
],
"source": [
"from sparknlp.training import CoNLL\n",
"\n",
"finance_data = CoNLL().readDataset(spark, \"conll_noO.conll\")\n",
"finance_data.selectExpr(\n",
" \"text\", \"token.result as tokens\", \"pos.result as pos\", \"label.result as label\"\n",
").show(3, False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "R6xa4jp8Szs0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "R6xa4jp8Szs0",
"outputId": "c1e620d6-25b0-481e-df97-8c67879d4b1a"
},
"outputs": [
{
"data": {
"text/plain": [
"1637"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"finance_data.count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "UE5jiEP-KJsh",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UE5jiEP-KJsh",
"outputId": "8f8389d8-31c2-426f-93f4-e84829a5f7bd"
},
"outputs": [
{
"data": {
"text/plain": [
"['text', 'document', 'sentence', 'token', 'pos', 'label']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"finance_data.columns"
]
},
{
"cell_type": "markdown",
"id": "LqVe225XJfLG",
"metadata": {
"id": "LqVe225XJfLG"
},
"source": [
"Checking the labels we have:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "yKmO5faJJXRJ",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "yKmO5faJJXRJ",
"outputId": "8adabe85-2b2e-48f4-ef8d-0b8de95fb624"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------+\n",
"|col |\n",
"+------------------+\n",
"|I-PROFIT_INCREASE |\n",
"|B-PROFIT |\n",
"|B-AMOUNT |\n",
"|I-PROFIT |\n",
"|B-PERCENTAGE |\n",
"|B-PROFIT_DECLINE |\n",
"|B-PROFIT_INCREASE |\n",
"|I-DATE |\n",
"|I-AMOUNT |\n",
"|B-EXPENSE |\n",
"|B-EXPENSE_INCREASE|\n",
"|I-EXPENSE_INCREASE|\n",
"|I-PROFIT_DECLINE |\n",
"|O |\n",
"|B-CURRENCY |\n",
"|I-PERCENTAGE |\n",
"|B-FISCAL_YEAR |\n",
"|I-FISCAL_YEAR |\n",
"|B-DATE |\n",
"|I-EXPENSE_DECREASE|\n",
"|B-EXPENSE_DECREASE|\n",
"|I-EXPENSE |\n",
"+------------------+\n",
"\n"
]
}
],
"source": [
"finance_data.select(F.explode(\"label.result\")).distinct().show(50, False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "-83y2Ak0Y3m1",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-83y2Ak0Y3m1",
"outputId": "8eb7b180-8355-4472-e044-6adc4ad75a93"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------+-----+\n",
"|ground_truth |count|\n",
"+------------------+-----+\n",
"|O |51912|\n",
"|I-DATE |1932 |\n",
"|I-FISCAL_YEAR |1812 |\n",
"|B-DATE |1797 |\n",
"|B-AMOUNT |1466 |\n",
"|B-CURRENCY |1461 |\n",
"|I-AMOUNT |1134 |\n",
"|B-FISCAL_YEAR |605 |\n",
"|I-EXPENSE_INCREASE|546 |\n",
"|I-EXPENSE_DECREASE|390 |\n",
"|B-PERCENTAGE |350 |\n",
"|I-PROFIT_INCREASE |288 |\n",
"|I-EXPENSE |280 |\n",
"|B-EXPENSE_INCREASE|274 |\n",
"|I-PROFIT |228 |\n",
"|B-EXPENSE_DECREASE|191 |\n",
"|B-PROFIT_INCREASE |164 |\n",
"|B-EXPENSE |150 |\n",
"|B-PROFIT |122 |\n",
"|I-PROFIT_DECLINE |93 |\n",
"|B-PROFIT_DECLINE |58 |\n",
"|I-PERCENTAGE |12 |\n",
"+------------------+-----+\n",
"\n"
]
}
],
"source": [
"finance_data.select(\n",
" F.explode(F.arrays_zip(finance_data.token.result, finance_data.label.result)).alias(\n",
" \"cols\"\n",
" )\n",
").select(\n",
" F.expr(\"cols['0']\").alias(\"token\"), F.expr(\"cols['1']\").alias(\"ground_truth\")\n",
").groupBy(\n",
" \"ground_truth\"\n",
").count().orderBy(\n",
" \"count\", ascending=False\n",
").show(\n",
" 100, truncate=False\n",
")"
]
},
{
"cell_type": "markdown",
"id": "kDIFq1bhDC4d",
"metadata": {
"id": "kDIFq1bhDC4d"
},
"source": [
"πThe CoNLL data already have the columns `document`, `sentence` and `token` that are needed to create the NER model, the only one that is missing is the Embeddings. So let's use the same embedding pretrained model as before to train this new one, but you could use any Embedding model instead (check [SparkNLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Embeddings) for a list of available embedding models)."
]
},
{
"cell_type": "markdown",
"id": "2WZDqlZA_kmb",
"metadata": {
"id": "2WZDqlZA_kmb"
},
"source": [
"###βοΈ Using Bert Embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7qfJh8ap_nI2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7qfJh8ap_nI2",
"outputId": "baa423cd-826a-4626-c26d-499fbc04e8f5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bert_embeddings_sec_bert_base download started this may take some time.\n",
"Approximate size to download 390.4 MB\n",
"[OK!]\n"
]
}
],
"source": [
"bert_embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\", \"en\") \\\n",
" .setInputCols(\"sentence\", \"token\") \\\n",
" .setOutputCol(\"embeddings\")\\\n",
" .setMaxSentenceLength(512)"
]
},
{
"cell_type": "markdown",
"id": "YUNd3OpOLuJB",
"metadata": {
"id": "YUNd3OpOLuJB"
},
"source": [
"Split the data into train and test sets"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8FOT9laXLt_c",
"metadata": {
"id": "8FOT9laXLt_c"
},
"outputs": [],
"source": [
"train_data, test_data = finance_data.randomSplit([0.8, 0.2], seed=42)"
]
},
{
"cell_type": "markdown",
"id": "ZnRJAgYUNIRm",
"metadata": {
"id": "ZnRJAgYUNIRm"
},
"source": [
"We transform the test data and store it into a parquet file so we can use it during training for testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "O0aOYktdNH-e",
"metadata": {
"id": "O0aOYktdNH-e"
},
"outputs": [],
"source": [
"bert_embeddings.transform(test_data).write.mode(\"overwrite\").parquet(\n",
" \"test_data_embeddings.parquet\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b1k6kd7SMpBs",
"metadata": {
"id": "b1k6kd7SMpBs"
},
"source": [
"πDeclare the train annotator using the `NerApproach`. In this example, we will train for only 2 epochs to illustrate how to use the annotator without spending too much time waiting the model to finish training, but we recommend to use 5-50 epochs depending on your application to obtain a proper model."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "Fe0957BT_rcy",
"metadata": {
"id": "Fe0957BT_rcy"
},
"outputs": [],
"source": [
"nerTagger = finance.NerApproach()\\\n",
" .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
" .setLabelColumn(\"label\")\\\n",
" .setOutputCol(\"ner\")\\\n",
" .setMaxEpochs(2)\\\n",
" .setLr(0.003)\\\n",
" .setBatchSize(32)\\\n",
" .setRandomSeed(0)\\\n",
" .setVerbose(1)\\\n",
" .setValidationSplit(0.2)\\\n",
" .setEvaluationLogExtended(True) \\\n",
" .setEnableOutputLogs(True)\\\n",
" .setIncludeConfidence(True)\\\n",
" .setEnableMemoryOptimizer(True)\\\n",
" .setOutputLogsPath('ner_logs') # if not set, logs will be written to ~/annotator_logs\n",
"# .setGraphFolder('graphs') >> put your graph file (pb) under this folder if you are using a custom graph generated the 4.1 NerDL-Graph.ipynb notebook or you can use TFGraphBuilder annotator \n",
"# .setEnableMemoryOptimizer(True)\\ # if you have a limited memory and a large conll file, you can set this True to train batch by batch\n",
"\n",
"ner_pipeline = nlp.Pipeline(\n",
" stages=[\n",
" bert_embeddings,\n",
" nerTagger\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "G59yuxavLt7Q",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "G59yuxavLt7Q",
"outputId": "1e49f04a-4f00-4a50-af6f-b6d5155c3493"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 11.1 s, sys: 1.26 s, total: 12.4 s\n",
"Wall time: 33min 46s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"ner_model = ner_pipeline.fit(train_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "-8itI3ckBOR7",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-8itI3ckBOR7",
"outputId": "e341545e-9c37-4df3-cfae-f9943c17f1de"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name of the selected graph: medical-ner-dl/blstm_38_768_128_200.pb\n",
"Training started - total epochs: 2 - lr: 0.003 - batch size: 32 - labels: 22 - chars: 96 - training examples: 1076\n",
"\n",
"\n",
"Epoch 1/2 started, lr: 0.003, dataset size: 1076\n",
"\n",
"\n",
"Epoch 1/2 - 459.17s - loss: 1156.4308 - avg training loss: 34.01267 - batches: 34\n",
"Quality on validation dataset (20.0%), validation examples = 215\n",
"time to finish evaluation: 365.12s\n",
"Total validation loss: 77.2285\tAvg validation loss: 8.5809\n",
"label\t tp\t fp\t fn\t prec\t rec\t f1\n",
"I-AMOUNT\t 147\t 1\t 8\t 0.9932432\t 0.9483871\t 0.970297\n",
"B-AMOUNT\t 209\t 7\t 2\t 0.9675926\t 0.9905213\t 0.9789227\n",
"B-DATE\t 194\t 20\t 100\t 0.90654206\t 0.65986395\t 0.7637795\n",
"I-DATE\t 278\t 65\t 39\t 0.8104956\t 0.8769716\t 0.8424242\n",
"I-EXPENSE\t 0\t 0\t 21\t 0.0\t 0.0\t 0.0\n",
"B-PROFIT_INCREASE\t 0\t 1\t 26\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE\t 0\t 0\t 15\t 0.0\t 0.0\t 0.0\n",
"I-PERCENTAGE\t 0\t 0\t 2\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT_DECLINE\t 0\t 0\t 19\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT\t 0\t 0\t 33\t 0.0\t 0.0\t 0.0\n",
"B-CURRENCY\t 210\t 2\t 0\t 0.990566\t 1.0\t 0.99526066\n",
"I-PROFIT_INCREASE\t 0\t 2\t 50\t 0.0\t 0.0\t 0.0\n",
"B-PROFIT\t 0\t 0\t 20\t 0.0\t 0.0\t 0.0\n",
"B-PERCENTAGE\t 40\t 5\t 22\t 0.8888889\t 0.6451613\t 0.7476635\n",
"I-FISCAL_YEAR\t 266\t 46\t 13\t 0.8525641\t 0.953405\t 0.90016925\n",
"B-PROFIT_DECLINE\t 0\t 0\t 9\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE_INCREASE\t 0\t 0\t 46\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE_DECREASE\t 10\t 19\t 24\t 0.3448276\t 0.29411766\t 0.31746033\n",
"B-FISCAL_YEAR\t 89\t 13\t 4\t 0.872549\t 0.9569892\t 0.9128205\n",
"I-EXPENSE_DECREASE\t 29\t 56\t 42\t 0.34117648\t 0.4084507\t 0.37179488\n",
"I-EXPENSE_INCREASE\t 0\t 1\t 96\t 0.0\t 0.0\t 0.0\n",
"tp: 1472 fp: 238 fn: 591 labels: 21\n",
"Macro-average\t prec: 0.37944978, rec: 0.3682794, f1: 0.37378114\n",
"Micro-average\t prec: 0.8608187, rec: 0.713524, f1: 0.7802809\n",
"\n",
"\n",
"Epoch 2/2 started, lr: 0.0029850747, dataset size: 1076\n",
"\n",
"\n",
"Epoch 2/2 - 444.35s - loss: 314.85 - avg training loss: 9.260294 - batches: 34\n",
"Quality on validation dataset (20.0%), validation examples = 215\n",
"time to finish evaluation: 368.89s\n",
"Total validation loss: 56.6601\tAvg validation loss: 6.2956\n",
"label\t tp\t fp\t fn\t prec\t rec\t f1\n",
"I-AMOUNT\t 149\t 4\t 6\t 0.9738562\t 0.9612903\t 0.96753246\n",
"B-AMOUNT\t 210\t 5\t 1\t 0.9767442\t 0.99526066\t 0.9859154\n",
"B-DATE\t 232\t 13\t 62\t 0.94693875\t 0.78911567\t 0.86085343\n",
"I-DATE\t 295\t 43\t 22\t 0.87278104\t 0.9305994\t 0.90076333\n",
"I-EXPENSE\t 0\t 0\t 21\t 0.0\t 0.0\t 0.0\n",
"B-PROFIT_INCREASE\t 8\t 13\t 18\t 0.3809524\t 0.30769232\t 0.34042555\n",
"B-EXPENSE\t 0\t 0\t 15\t 0.0\t 0.0\t 0.0\n",
"I-PERCENTAGE\t 0\t 0\t 2\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT_DECLINE\t 0\t 0\t 19\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT\t 0\t 0\t 33\t 0.0\t 0.0\t 0.0\n",
"B-CURRENCY\t 210\t 2\t 0\t 0.990566\t 1.0\t 0.99526066\n",
"I-PROFIT_INCREASE\t 29\t 35\t 21\t 0.453125\t 0.58\t 0.50877196\n",
"B-PROFIT\t 0\t 0\t 20\t 0.0\t 0.0\t 0.0\n",
"B-PERCENTAGE\t 58\t 6\t 4\t 0.90625\t 0.9354839\t 0.92063487\n",
"I-FISCAL_YEAR\t 272\t 34\t 7\t 0.8888889\t 0.9749104\t 0.9299145\n",
"B-PROFIT_DECLINE\t 0\t 0\t 9\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE_INCREASE\t 3\t 0\t 43\t 1.0\t 0.06521739\t 0.12244898\n",
"B-EXPENSE_DECREASE\t 18\t 25\t 16\t 0.41860464\t 0.5294118\t 0.4675325\n",
"B-FISCAL_YEAR\t 91\t 19\t 2\t 0.8272727\t 0.97849464\t 0.8965517\n",
"I-EXPENSE_DECREASE\t 25\t 41\t 46\t 0.37878788\t 0.35211268\t 0.36496353\n",
"I-EXPENSE_INCREASE\t 19\t 23\t 77\t 0.45238096\t 0.19791667\t 0.2753623\n",
"tp: 1619 fp: 263 fn: 444 labels: 21\n",
"Macro-average\t prec: 0.49843565, rec: 0.45702407, f1: 0.47683245\n",
"Micro-average\t prec: 0.86025506, rec: 0.7847794, f1: 0.82078576\n",
"\n"
]
}
],
"source": [
"import os\n",
"\n",
"log_files = os.listdir(\"./ner_logs\")\n",
"with open(\"./ner_logs/\"+log_files[0]) as log_file:\n",
" print(log_file.read())"
]
},
{
"cell_type": "markdown",
"id": "2TjOQ0BTEvGF",
"metadata": {
"id": "2TjOQ0BTEvGF"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "riQTP4wuQfVf",
"metadata": {
"id": "riQTP4wuQfVf"
},
"source": [
"###βοΈ Splitting Dataset Into Train and Test Set\n",
"\n",
"Also we will use `.setTestDataset('test_data_embeddings.parquet')` for checking test-loss values of each epoch in the logs file and `.useBestModel(True)` parameter whether to restore and use the model that has achieved the best performance at the end of the training.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "B35v8bF9KJhu",
"metadata": {
"id": "B35v8bF9KJhu"
},
"outputs": [],
"source": [
"! mkdir ner_logs_best"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec0zEZhU33x8",
"metadata": {
"id": "ec0zEZhU33x8"
},
"outputs": [],
"source": [
"nerTagger = (\n",
" finance.NerApproach()\n",
" .setInputCols([\"sentence\", \"token\", \"embeddings\"])\n",
" .setLabelColumn(\"label\")\n",
" .setOutputCol(\"ner\")\n",
" .setMaxEpochs(2)\n",
" .setLr(0.002)\n",
" .setBatchSize(32)\n",
" .setRandomSeed(0)\n",
" .setVerbose(1)\n",
" .setValidationSplit(0.0)\n",
" .setEvaluationLogExtended(True)\n",
" .setEnableOutputLogs(True)\n",
" .setIncludeConfidence(True)\n",
" .setEnableMemoryOptimizer(True)\\\n",
" .setTestDataset(\"test_data_embeddings.parquet\")\n",
" .setOutputLogsPath('ner_logs_best') # if not set, logs will be written to ~/annotator_logs\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "NT_Wzmb7LVkZ",
"metadata": {
"id": "NT_Wzmb7LVkZ"
},
"outputs": [],
"source": [
"ner_pipeline = nlp.Pipeline(\n",
" stages=[\n",
" bert_embeddings,\n",
" nerTagger\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "NDBkjMuPKQ3I",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NDBkjMuPKQ3I",
"outputId": "81e247d4-43f3-49c8-f453-f88585b4fd26"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.43 s, sys: 863 ms, total: 8.3 s\n",
"Wall time: 22min 17s\n"
]
}
],
"source": [
"%%time\n",
"ner_model = ner_pipeline.fit(train_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4C8bpmGdKmWW",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4C8bpmGdKmWW",
"outputId": "cc035f2d-4173-4f40-d12c-ce88803dd58d"
},
"outputs": [
{
"data": {
"text/plain": [
"['FinanceNerApproach_72b075f0da70.log']"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log_files = os.listdir(\"./ner_logs_best/\")\n",
"\n",
"log_files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5UVP6bYwKov2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5UVP6bYwKov2",
"outputId": "6cc0f415-8f76-4d7b-99fc-9aacc65089e0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name of the selected graph: medical-ner-dl/blstm_38_768_128_200.pb\n",
"Training started - total epochs: 2 - lr: 0.002 - batch size: 32 - labels: 22 - chars: 98 - training examples: 1338\n",
"\n",
"\n",
"Epoch 1/2 started, lr: 0.002, dataset size: 1338\n",
"\n",
"\n",
"Epoch 1/2 - 463.01s - loss: 1153.6621 - avg training loss: 27.468145 - batches: 42\n",
"Quality on test dataset: \n",
"time to finish evaluation: 14.18s\n",
"Total test loss: 84.3584\tAvg test loss: 8.4358\n",
"label\t tp\t fp\t fn\t prec\t rec\t f1\n",
"I-AMOUNT\t 191\t 5\t 8\t 0.9744898\t 0.959799\t 0.96708864\n",
"B-AMOUNT\t 248\t 9\t 3\t 0.96498054\t 0.98804784\t 0.9763779\n",
"B-DATE\t 302\t 18\t 66\t 0.94375\t 0.8206522\t 0.87790704\n",
"I-DATE\t 423\t 24\t 51\t 0.94630873\t 0.8924051\t 0.91856676\n",
"I-EXPENSE\t 0\t 0\t 61\t 0.0\t 0.0\t 0.0\n",
"B-PROFIT_INCREASE\t 2\t 1\t 20\t 0.6666667\t 0.09090909\t 0.16000001\n",
"B-EXPENSE\t 0\t 0\t 29\t 0.0\t 0.0\t 0.0\n",
"I-PERCENTAGE\t 0\t 0\t 4\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT_DECLINE\t 0\t 0\t 11\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT\t 0\t 0\t 33\t 0.0\t 0.0\t 0.0\n",
"B-CURRENCY\t 249\t 5\t 0\t 0.98031497\t 1.0\t 0.9900597\n",
"I-PROFIT_INCREASE\t 2\t 0\t 44\t 1.0\t 0.04347826\t 0.083333336\n",
"B-PROFIT\t 0\t 0\t 21\t 0.0\t 0.0\t 0.0\n",
"B-PERCENTAGE\t 51\t 1\t 12\t 0.9807692\t 0.8095238\t 0.8869566\n",
"I-FISCAL_YEAR\t 270\t 28\t 17\t 0.90604025\t 0.9407666\t 0.923077\n",
"B-PROFIT_DECLINE\t 0\t 0\t 8\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE_INCREASE\t 0\t 0\t 60\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE_DECREASE\t 14\t 45\t 15\t 0.23728813\t 0.4827586\t 0.3181818\n",
"B-FISCAL_YEAR\t 90\t 8\t 6\t 0.9183673\t 0.9375\t 0.927835\n",
"I-EXPENSE_DECREASE\t 36\t 156\t 20\t 0.1875\t 0.64285713\t 0.2903226\n",
"I-EXPENSE_INCREASE\t 0\t 0\t 118\t 0.0\t 0.0\t 0.0\n",
"tp: 1878 fp: 300 fn: 607 labels: 21\n",
"Macro-average\t prec: 0.4622131, rec: 0.409938, f1: 0.4345089\n",
"Micro-average\t prec: 0.862259, rec: 0.7557344, f1: 0.80549\n",
"\n",
"\n",
"Epoch 2/2 started, lr: 0.0019900498, dataset size: 1338\n",
"\n",
"\n",
"Epoch 2/2 - 466.40s - loss: 357.03232 - avg training loss: 8.50077 - batches: 42\n",
"Quality on test dataset: \n",
"time to finish evaluation: 11.59s\n",
"Total test loss: 59.1781\tAvg test loss: 5.9178\n",
"label\t tp\t fp\t fn\t prec\t rec\t f1\n",
"I-AMOUNT\t 195\t 6\t 4\t 0.9701493\t 0.9798995\t 0.975\n",
"B-AMOUNT\t 250\t 6\t 1\t 0.9765625\t 0.99601597\t 0.9861933\n",
"B-DATE\t 348\t 31\t 20\t 0.9182058\t 0.9456522\t 0.9317269\n",
"I-DATE\t 446\t 21\t 28\t 0.9550321\t 0.9409283\t 0.9479278\n",
"I-EXPENSE\t 0\t 1\t 61\t 0.0\t 0.0\t 0.0\n",
"B-PROFIT_INCREASE\t 14\t 12\t 8\t 0.53846157\t 0.6363636\t 0.5833334\n",
"B-EXPENSE\t 0\t 0\t 29\t 0.0\t 0.0\t 0.0\n",
"I-PERCENTAGE\t 0\t 0\t 4\t 0.0\t 0.0\t 0.0\n",
"I-PROFIT_DECLINE\t 2\t 3\t 9\t 0.4\t 0.18181819\t 0.25\n",
"I-PROFIT\t 0\t 1\t 33\t 0.0\t 0.0\t 0.0\n",
"B-CURRENCY\t 249\t 5\t 0\t 0.98031497\t 1.0\t 0.9900597\n",
"I-PROFIT_INCREASE\t 31\t 36\t 15\t 0.46268657\t 0.67391306\t 0.54867256\n",
"B-PROFIT\t 0\t 0\t 21\t 0.0\t 0.0\t 0.0\n",
"B-PERCENTAGE\t 62\t 3\t 1\t 0.95384616\t 0.984127\t 0.96875\n",
"I-FISCAL_YEAR\t 279\t 12\t 8\t 0.9587629\t 0.9721254\t 0.9653979\n",
"B-PROFIT_DECLINE\t 0\t 0\t 8\t 0.0\t 0.0\t 0.0\n",
"B-EXPENSE_INCREASE\t 1\t 0\t 59\t 1.0\t 0.016666668\t 0.032786887\n",
"B-EXPENSE_DECREASE\t 20\t 63\t 9\t 0.24096386\t 0.6896552\t 0.35714287\n",
"B-FISCAL_YEAR\t 89\t 6\t 7\t 0.9368421\t 0.9270833\t 0.93193716\n",
"I-EXPENSE_DECREASE\t 43\t 125\t 13\t 0.2559524\t 0.76785713\t 0.38392857\n",
"I-EXPENSE_INCREASE\t 0\t 0\t 118\t 0.0\t 0.0\t 0.0\n",
"tp: 2029 fp: 331 fn: 456 labels: 21\n",
"Macro-average\t prec: 0.5022753, rec: 0.51010025, f1: 0.5061575\n",
"Micro-average\t prec: 0.85974574, rec: 0.816499, f1: 0.8375645\n",
"\n"
]
}
],
"source": [
"with open(\"./ner_logs_best/\"+log_files[0]) as log_file:\n",
" print(log_file.read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "w9g-vbvhfntQ",
"metadata": {
"id": "w9g-vbvhfntQ"
},
"outputs": [],
"source": [
"# test_data = bert_embeddings.transform(test_data)\n",
"\n",
"predictions = ner_model.transform(test_data)\n",
"\n",
"from sklearn.metrics import classification_report\n",
"\n",
"preds_df = predictions.select(F.explode(F.arrays_zip(predictions.token.result,\n",
" predictions.label.result,\n",
" predictions.ner.result)).alias(\"cols\")) \\\n",
" .select(F.expr(\"cols['0']\").alias(\"token\"),\n",
" F.expr(\"cols['1']\").alias(\"ground_truth\"),\n",
" F.expr(\"cols['2']\").alias(\"prediction\")).toPandas()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wL4Cqq-uzRhg",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wL4Cqq-uzRhg",
"outputId": "cbb7001f-2bf8-42eb-c52a-2bb48248a043"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" B-AMOUNT 0.9766 0.9960 0.9862 251\n",
" B-CURRENCY 0.9803 1.0000 0.9901 249\n",
" B-DATE 0.9182 0.9457 0.9317 368\n",
" B-EXPENSE 0.0000 0.0000 0.0000 29\n",
"B-EXPENSE_DECREASE 0.2410 0.6897 0.3571 29\n",
"B-EXPENSE_INCREASE 1.0000 0.0167 0.0328 60\n",
" B-FISCAL_YEAR 0.9368 0.9271 0.9319 96\n",
" B-PERCENTAGE 0.9538 0.9841 0.9688 63\n",
" B-PROFIT 0.0000 0.0000 0.0000 21\n",
" B-PROFIT_DECLINE 0.0000 0.0000 0.0000 8\n",
" B-PROFIT_INCREASE 0.5385 0.6364 0.5833 22\n",
" I-AMOUNT 0.9701 0.9799 0.9750 199\n",
" I-DATE 0.9550 0.9409 0.9479 474\n",
" I-EXPENSE 0.0000 0.0000 0.0000 61\n",
"I-EXPENSE_DECREASE 0.2560 0.7679 0.3839 56\n",
"I-EXPENSE_INCREASE 0.0000 0.0000 0.0000 118\n",
" I-FISCAL_YEAR 0.9588 0.9721 0.9654 287\n",
" I-PERCENTAGE 0.0000 0.0000 0.0000 4\n",
" I-PROFIT 0.0000 0.0000 0.0000 33\n",
" I-PROFIT_DECLINE 0.4000 0.1818 0.2500 11\n",
" I-PROFIT_INCREASE 0.4627 0.6739 0.5487 46\n",
" O 0.9760 0.9888 0.9824 9552\n",
"\n",
" accuracy 0.9532 12037\n",
" macro avg 0.5238 0.5319 0.4925 12037\n",
" weighted avg 0.9421 0.9532 0.9443 12037\n",
"\n"
]
}
],
"source": [
"print(classification_report(preds_df['ground_truth'], preds_df['prediction'], digits=4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "uw5HgQ_FMzwj",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uw5HgQ_FMzwj",
"outputId": "8dcb8567-d169-413c-d3f7-a844b929b4db"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------------------+----------+\n",
"|token |ground_truth |prediction|\n",
"+------------+------------------+----------+\n",
"|$ |B-CURRENCY |B-CURRENCY|\n",
"|2.6 |B-AMOUNT |B-AMOUNT |\n",
"|million |I-AMOUNT |I-AMOUNT |\n",
"|of |O |O |\n",
"|the |O |O |\n",
"|increase |O |O |\n",
"|was |O |O |\n",
"|attributable|O |O |\n",
"|to |O |O |\n",
"|our |O |O |\n",
"|increased |O |O |\n",
"|hosting |B-EXPENSE_INCREASE|O |\n",
"|costs |I-EXPENSE_INCREASE|O |\n",
"|largely |O |O |\n",
"|associated |O |O |\n",
"|with |O |O |\n",
"|the |O |O |\n",
"|increased |O |O |\n",
"|adoption |O |O |\n",
"|of |O |O |\n",
"+------------+------------------+----------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"predictions.select(F.explode(F.arrays_zip(predictions.token.result,\n",
" predictions.label.result,\n",
" predictions.ner.result)).alias(\"cols\")) \\\n",
" .select(F.expr(\"cols['0']\").alias(\"token\"),\n",
" F.expr(\"cols['1']\").alias(\"ground_truth\"),\n",
" F.expr(\"cols['2']\").alias(\"prediction\")).show(truncate=False)\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "P0sD2CU4HP-H",
"metadata": {
"id": "P0sD2CU4HP-H"
},
"source": [
"###βοΈ Entity level evaluation (strict eval)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6zcBUVcBe5y",
"metadata": {
"id": "b6zcBUVcBe5y"
},
"outputs": [],
"source": [
"!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/open-source-nlp/utils/conll_eval.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "jpiIrbx5I8qI",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "jpiIrbx5I8qI",
"outputId": "9fb83261-9846-45b2-d6b9-ff4a2b582070"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"processed 12037 tokens with 1196 phrases; found: 1204 phrases; correct: 1021.\n",
"accuracy: 81.65%; (non-O)\n",
"accuracy: 95.32%; precision: 84.80%; recall: 85.37%; FB1: 85.08\n",
" AMOUNT: precision: 94.25%; recall: 98.01%; FB1: 96.09 261\n",
" CURRENCY: precision: 98.03%; recall: 100.00%; FB1: 99.01 254\n",
" DATE: precision: 89.32%; recall: 93.21%; FB1: 91.22 384\n",
" EXPENSE: precision: 0.00%; recall: 0.00%; FB1: 0.00 1\n",
" EXPENSE_DECREASE: precision: 20.65%; recall: 65.52%; FB1: 31.40 92\n",
" EXPENSE_INCREASE: precision: 0.00%; recall: 0.00%; FB1: 0.00 1\n",
" FISCAL_YEAR: precision: 89.90%; recall: 92.71%; FB1: 91.28 99\n",
" PERCENTAGE: precision: 93.85%; recall: 96.83%; FB1: 95.31 65\n",
" PROFIT: precision: 0.00%; recall: 0.00%; FB1: 0.00 1\n",
" PROFIT_DECLINE: precision: 0.00%; recall: 0.00%; FB1: 0.00 4\n",
" PROFIT_INCREASE: precision: 33.33%; recall: 63.64%; FB1: 43.75 42\n"
]
}
],
"source": [
"import conll_eval\n",
"\n",
"metrics = conll_eval.evaluate(preds_df['ground_truth'].values, preds_df['prediction'].values)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "vZ0jme54KDyC",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vZ0jme54KDyC",
"outputId": "8b0227a0-51cf-4cdc-ad05-607cd91fc2c9"
},
"outputs": [
{
"data": {
"text/plain": [
"(84.80066445182725, 85.36789297658864, 85.08333333333334)"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# micro, macro, avg\n",
"metrics[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "YMZu0ottJkmn",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 394
},
"id": "YMZu0ottJkmn",
"outputId": "658d923b-b8a3-435f-8e91-77b500997e34"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" precision | \n",
" recall | \n",
" f1 | \n",
" support | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" AMOUNT | \n",
" 94.252874 | \n",
" 98.007968 | \n",
" 96.093750 | \n",
" 261 | \n",
"
\n",
" \n",
" | 1 | \n",
" CURRENCY | \n",
" 98.031496 | \n",
" 100.000000 | \n",
" 99.005964 | \n",
" 254 | \n",
"
\n",
" \n",
" | 2 | \n",
" DATE | \n",
" 89.322917 | \n",
" 93.206522 | \n",
" 91.223404 | \n",
" 384 | \n",
"
\n",
" \n",
" | 3 | \n",
" EXPENSE | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" EXPENSE_DECREASE | \n",
" 20.652174 | \n",
" 65.517241 | \n",
" 31.404959 | \n",
" 92 | \n",
"
\n",
" \n",
" | 5 | \n",
" EXPENSE_INCREASE | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1 | \n",
"
\n",
" \n",
" | 6 | \n",
" FISCAL_YEAR | \n",
" 89.898990 | \n",
" 92.708333 | \n",
" 91.282051 | \n",
" 99 | \n",
"
\n",
" \n",
" | 7 | \n",
" PERCENTAGE | \n",
" 93.846154 | \n",
" 96.825397 | \n",
" 95.312500 | \n",
" 65 | \n",
"
\n",
" \n",
" | 8 | \n",
" PROFIT | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1 | \n",
"
\n",
" \n",
" | 9 | \n",
" PROFIT_DECLINE | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 4 | \n",
"
\n",
" \n",
" | 10 | \n",
" PROFIT_INCREASE | \n",
" 33.333333 | \n",
" 63.636364 | \n",
" 43.750000 | \n",
" 42 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" entity precision recall f1 support\n",
"0 AMOUNT 94.252874 98.007968 96.093750 261\n",
"1 CURRENCY 98.031496 100.000000 99.005964 254\n",
"2 DATE 89.322917 93.206522 91.223404 384\n",
"3 EXPENSE 0.000000 0.000000 0.000000 1\n",
"4 EXPENSE_DECREASE 20.652174 65.517241 31.404959 92\n",
"5 EXPENSE_INCREASE 0.000000 0.000000 0.000000 1\n",
"6 FISCAL_YEAR 89.898990 92.708333 91.282051 99\n",
"7 PERCENTAGE 93.846154 96.825397 95.312500 65\n",
"8 PROFIT 0.000000 0.000000 0.000000 1\n",
"9 PROFIT_DECLINE 0.000000 0.000000 0.000000 4\n",
"10 PROFIT_INCREASE 33.333333 63.636364 43.750000 42"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"pd.DataFrame(metrics[1], columns=['entity','precision','recall','f1','support'])"
]
},
{
"cell_type": "markdown",
"id": "DVBxVC2yi12r",
"metadata": {
"id": "DVBxVC2yi12r"
},
"source": [
"###βοΈ Ner log parser"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cyKawgE8i4TN",
"metadata": {
"id": "cyKawgE8i4TN"
},
"outputs": [],
"source": [
"!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/open-source-nlp/utils/ner_log_parser.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "hQ2tEbyRjC0E",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "hQ2tEbyRjC0E",
"outputId": "d932d50c-1327-456d-b0c9-db3110e00e1b"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import ner_log_parser\n",
"\n",
"%matplotlib inline\n",
"\n",
"ner_log_parser.get_charts('./ner_logs_best/'+log_files[0])"
]
},
{
"cell_type": "markdown",
"id": "GcQKMIYI3h4o",
"metadata": {
"id": "GcQKMIYI3h4o"
},
"source": [
"**Plotting Loss**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3RTrm5EU3OWb",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 513
},
"id": "3RTrm5EU3OWb",
"outputId": "6eda6e02-5dae-4861-dba4-41e1538b2cbb"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ner_log_parser.loss_plot('./ner_logs_best/'+log_files[0])"
]
},
{
"cell_type": "markdown",
"id": "WuJ5YZ9sXU13",
"metadata": {
"id": "WuJ5YZ9sXU13"
},
"source": [
"###πΎ Saving the trained model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "KBcoOwvwXV8p",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KBcoOwvwXV8p",
"outputId": "0cf195ff-28cb-4ad8-e83a-8e98eec3b5e2"
},
"outputs": [
{
"data": {
"text/plain": [
"[BERT_EMBEDDINGS_29ce72cd673e, FinanceNerModel_80baf3edad7a]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ner_model.stages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "gRLbhTh1XYo2",
"metadata": {
"id": "gRLbhTh1XYo2"
},
"outputs": [],
"source": [
"ner_model.stages[1].write().overwrite().save('NER_bert_e2_b32')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "k6B_m0HeXhvo",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "k6B_m0HeXhvo",
"outputId": "2495cafa-8974-4759-825e-63c151ef6ead"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 1052\n",
"drwxr-xr-x 4 root root 4096 Jan 23 20:38 NER_bert_e2_b32\n",
"drwxr-xr-x 2 root root 4096 Jan 23 20:38 __pycache__\n",
"-rw-r--r-- 1 root root 3826 Jan 23 20:38 ner_log_parser.py\n",
"-rw-r--r-- 1 root root 7431 Jan 23 20:38 conll_eval.py\n",
"drwxr-xr-x 2 root root 4096 Jan 23 20:20 ner_logs_best\n",
"drwxr-xr-x 2 root root 4096 Jan 23 19:47 ner_logs\n",
"drwxr-xr-x 2 root root 4096 Jan 23 19:37 test_data_embeddings.parquet\n",
"-rw-r--r-- 1 root root 1033219 Jan 23 19:05 conll_noO.conll\n",
"-rw-r--r-- 1 root root 1785 Jan 23 19:01 'spark_nlp_for_healthcare_spark_ocr_7162 (4).json'\n",
"drwxr-xr-x 1 root root 4096 Jan 20 14:35 sample_data\n"
]
}
],
"source": [
"!ls -lt"
]
},
{
"cell_type": "markdown",
"id": "gK0rbohHRNmG",
"metadata": {
"id": "gK0rbohHRNmG"
},
"source": [
"###βοΈ Prediction Pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "HkB6TUhpMFvB",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HkB6TUhpMFvB",
"outputId": "d556df94-b888-4925-8448-6a149978a43c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bert_embeddings_sec_bert_base download started this may take some time.\n",
"Approximate size to download 390.4 MB\n",
"[OK!]\n"
]
}
],
"source": [
"document = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"text_splitter = finance.TextSplitter()\\\n",
" .setInputCols(['document'])\\\n",
" .setOutputCol('sentence')\n",
"\n",
"token = nlp.Tokenizer()\\\n",
" .setInputCols(['sentence'])\\\n",
" .setOutputCol('token')\n",
"\n",
"bert_embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\", \"en\") \\\n",
" .setInputCols(\"sentence\", \"token\") \\\n",
" .setOutputCol(\"embeddings\")\\\n",
" .setMaxSentenceLength(512)\n",
" \n",
"# load trained model\n",
"loaded_ner_model = finance.NerModel.load(\"NER_bert_e2_b32\")\\\n",
" .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
" .setOutputCol(\"ner\")\n",
"\n",
"converter = finance.NerConverterInternal()\\\n",
" .setInputCols([\"document\", \"token\", \"ner\"])\\\n",
" .setOutputCol(\"ner_span\")\n",
"\n",
"ner_prediction_pipeline = nlp.Pipeline(\n",
" stages = [\n",
" document,\n",
" text_splitter,\n",
" token,\n",
" bert_embeddings,\n",
" loaded_ner_model,\n",
" converter\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "gsokdubdX1vE",
"metadata": {
"id": "gsokdubdX1vE"
},
"outputs": [],
"source": [
"empty_data = spark.createDataFrame([['']]).toDF(\"text\")\n",
"\n",
"prediction_model = ner_prediction_pipeline.fit(empty_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "rR8b0tQlX7E8",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rR8b0tQlX7E8",
"outputId": "08d23380-97d0-4fda-b0e4-0cbadfed585f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
"|text |\n",
"+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
"|$ 4.2 million of the increase was compensation related and primarily attributable to an increase in headcount to support the continued growth of our subscription SaaS offerings and ongoing maintenance and support for our expanding customer base .|\n",
"+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
"\n"
]
}
],
"source": [
"text = \"\"\"$ 4.2 million of the increase was compensation related and primarily attributable to an increase in headcount to support the continued growth of our subscription SaaS offerings and ongoing maintenance and support for our expanding customer base .\"\"\"\n",
"\n",
"sample_data = spark.createDataFrame([[text]]).toDF(\"text\")\n",
"\n",
"sample_data.show(truncate=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "WdeKg30uX_rk",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WdeKg30uX_rk",
"outputId": "0df2ca92-185e-4bab-821a-9a9f2e65c938"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------+--------+\n",
"|chunk |entity |\n",
"+-----------+--------+\n",
"|$ |CURRENCY|\n",
"|4.2 million|AMOUNT |\n",
"+-----------+--------+\n",
"\n"
]
}
],
"source": [
"preds = prediction_model.transform(sample_data)\n",
"\n",
"preds.select(F.explode(F.arrays_zip(preds.ner_span.result,\n",
" preds.ner_span.metadata)).alias(\"entities\")) \\\n",
" .select(F.expr(\"entities['0']\").alias(\"chunk\"),\n",
" F.expr(\"entities['1'].entity\").alias(\"entity\")).show(truncate=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "UG6eBHQZYIhb",
"metadata": {
"id": "UG6eBHQZYIhb"
},
"outputs": [],
"source": [
"light_model = nlp.LightPipeline(prediction_model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7Hr5gtKbYLOW",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7Hr5gtKbYLOW",
"outputId": "316be9e6-9656-47e2-aecb-5327398b75b1"
},
"outputs": [
{
"data": {
"text/plain": [
"[('$', 'B-CURRENCY'),\n",
" ('4.2', 'B-AMOUNT'),\n",
" ('million', 'I-AMOUNT'),\n",
" ('of', 'O'),\n",
" ('the', 'O'),\n",
" ('increase', 'O'),\n",
" ('was', 'O'),\n",
" ('compensation', 'O'),\n",
" ('related', 'O'),\n",
" ('and', 'O'),\n",
" ('primarily', 'O'),\n",
" ('attributable', 'O'),\n",
" ('to', 'O'),\n",
" ('an', 'O'),\n",
" ('increase', 'O'),\n",
" ('in', 'O'),\n",
" ('headcount', 'O'),\n",
" ('to', 'O'),\n",
" ('support', 'O'),\n",
" ('the', 'O'),\n",
" ('continued', 'O'),\n",
" ('growth', 'O'),\n",
" ('of', 'O'),\n",
" ('our', 'O'),\n",
" ('subscription', 'O'),\n",
" ('SaaS', 'O'),\n",
" ('offerings', 'O'),\n",
" ('and', 'O'),\n",
" ('ongoing', 'O'),\n",
" ('maintenance', 'O'),\n",
" ('and', 'O'),\n",
" ('support', 'O'),\n",
" ('for', 'O'),\n",
" ('our', 'O'),\n",
" ('expanding', 'O'),\n",
" ('customer', 'O'),\n",
" ('base', 'O'),\n",
" ('.', 'O')]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = \"\"\"$ 4.2 million of the increase was compensation related and primarily attributable to an increase in headcount to support the continued growth of our subscription SaaS offerings and ongoing maintenance and support for our expanding customer base .\"\"\"\n",
"\n",
"result_ann = light_model.annotate(text)\n",
"\n",
"list(zip(result_ann['token'], result_ann['ner']))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "OO4xKzhIZEDc",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "OO4xKzhIZEDc",
"outputId": "7a3dcf63-c165-4637-e9e7-52bcd2ef4ea6"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sent_id | \n",
" token | \n",
" start | \n",
" end | \n",
" ner | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" $ | \n",
" 0 | \n",
" 0 | \n",
" B-CURRENCY | \n",
"
\n",
" \n",
" | 1 | \n",
" 0 | \n",
" 4.2 | \n",
" 2 | \n",
" 4 | \n",
" B-AMOUNT | \n",
"
\n",
" \n",
" | 2 | \n",
" 0 | \n",
" million | \n",
" 6 | \n",
" 12 | \n",
" I-AMOUNT | \n",
"
\n",
" \n",
" | 3 | \n",
" 0 | \n",
" of | \n",
" 14 | \n",
" 15 | \n",
" O | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" the | \n",
" 17 | \n",
" 19 | \n",
" O | \n",
"
\n",
" \n",
" | 5 | \n",
" 0 | \n",
" increase | \n",
" 21 | \n",
" 28 | \n",
" O | \n",
"
\n",
" \n",
" | 6 | \n",
" 0 | \n",
" was | \n",
" 30 | \n",
" 32 | \n",
" O | \n",
"
\n",
" \n",
" | 7 | \n",
" 0 | \n",
" compensation | \n",
" 34 | \n",
" 45 | \n",
" O | \n",
"
\n",
" \n",
" | 8 | \n",
" 0 | \n",
" related | \n",
" 47 | \n",
" 53 | \n",
" O | \n",
"
\n",
" \n",
" | 9 | \n",
" 0 | \n",
" and | \n",
" 55 | \n",
" 57 | \n",
" O | \n",
"
\n",
" \n",
" | 10 | \n",
" 0 | \n",
" primarily | \n",
" 59 | \n",
" 67 | \n",
" O | \n",
"
\n",
" \n",
" | 11 | \n",
" 0 | \n",
" attributable | \n",
" 69 | \n",
" 80 | \n",
" O | \n",
"
\n",
" \n",
" | 12 | \n",
" 0 | \n",
" to | \n",
" 82 | \n",
" 83 | \n",
" O | \n",
"
\n",
" \n",
" | 13 | \n",
" 0 | \n",
" an | \n",
" 85 | \n",
" 86 | \n",
" O | \n",
"
\n",
" \n",
" | 14 | \n",
" 0 | \n",
" increase | \n",
" 88 | \n",
" 95 | \n",
" O | \n",
"
\n",
" \n",
" | 15 | \n",
" 0 | \n",
" in | \n",
" 97 | \n",
" 98 | \n",
" O | \n",
"
\n",
" \n",
" | 16 | \n",
" 0 | \n",
" headcount | \n",
" 100 | \n",
" 108 | \n",
" O | \n",
"
\n",
" \n",
" | 17 | \n",
" 0 | \n",
" to | \n",
" 110 | \n",
" 111 | \n",
" O | \n",
"
\n",
" \n",
" | 18 | \n",
" 0 | \n",
" support | \n",
" 113 | \n",
" 119 | \n",
" O | \n",
"
\n",
" \n",
" | 19 | \n",
" 0 | \n",
" the | \n",
" 121 | \n",
" 123 | \n",
" O | \n",
"
\n",
" \n",
" | 20 | \n",
" 0 | \n",
" continued | \n",
" 125 | \n",
" 133 | \n",
" O | \n",
"
\n",
" \n",
" | 21 | \n",
" 0 | \n",
" growth | \n",
" 135 | \n",
" 140 | \n",
" O | \n",
"
\n",
" \n",
" | 22 | \n",
" 0 | \n",
" of | \n",
" 142 | \n",
" 143 | \n",
" O | \n",
"
\n",
" \n",
" | 23 | \n",
" 0 | \n",
" our | \n",
" 145 | \n",
" 147 | \n",
" O | \n",
"
\n",
" \n",
" | 24 | \n",
" 0 | \n",
" subscription | \n",
" 149 | \n",
" 160 | \n",
" O | \n",
"
\n",
" \n",
" | 25 | \n",
" 0 | \n",
" SaaS | \n",
" 162 | \n",
" 165 | \n",
" O | \n",
"
\n",
" \n",
" | 26 | \n",
" 0 | \n",
" offerings | \n",
" 167 | \n",
" 175 | \n",
" O | \n",
"
\n",
" \n",
" | 27 | \n",
" 0 | \n",
" and | \n",
" 177 | \n",
" 179 | \n",
" O | \n",
"
\n",
" \n",
" | 28 | \n",
" 0 | \n",
" ongoing | \n",
" 181 | \n",
" 187 | \n",
" O | \n",
"
\n",
" \n",
" | 29 | \n",
" 0 | \n",
" maintenance | \n",
" 189 | \n",
" 199 | \n",
" O | \n",
"
\n",
" \n",
" | 30 | \n",
" 0 | \n",
" and | \n",
" 201 | \n",
" 203 | \n",
" O | \n",
"
\n",
" \n",
" | 31 | \n",
" 0 | \n",
" support | \n",
" 205 | \n",
" 211 | \n",
" O | \n",
"
\n",
" \n",
" | 32 | \n",
" 0 | \n",
" for | \n",
" 213 | \n",
" 215 | \n",
" O | \n",
"
\n",
" \n",
" | 33 | \n",
" 0 | \n",
" our | \n",
" 217 | \n",
" 219 | \n",
" O | \n",
"
\n",
" \n",
" | 34 | \n",
" 0 | \n",
" expanding | \n",
" 221 | \n",
" 229 | \n",
" O | \n",
"
\n",
" \n",
" | 35 | \n",
" 0 | \n",
" customer | \n",
" 231 | \n",
" 238 | \n",
" O | \n",
"
\n",
" \n",
" | 36 | \n",
" 0 | \n",
" base | \n",
" 240 | \n",
" 243 | \n",
" O | \n",
"
\n",
" \n",
" | 37 | \n",
" 0 | \n",
" . | \n",
" 245 | \n",
" 245 | \n",
" O | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" sent_id token start end ner\n",
"0 0 $ 0 0 B-CURRENCY\n",
"1 0 4.2 2 4 B-AMOUNT\n",
"2 0 million 6 12 I-AMOUNT\n",
"3 0 of 14 15 O\n",
"4 0 the 17 19 O\n",
"5 0 increase 21 28 O\n",
"6 0 was 30 32 O\n",
"7 0 compensation 34 45 O\n",
"8 0 related 47 53 O\n",
"9 0 and 55 57 O\n",
"10 0 primarily 59 67 O\n",
"11 0 attributable 69 80 O\n",
"12 0 to 82 83 O\n",
"13 0 an 85 86 O\n",
"14 0 increase 88 95 O\n",
"15 0 in 97 98 O\n",
"16 0 headcount 100 108 O\n",
"17 0 to 110 111 O\n",
"18 0 support 113 119 O\n",
"19 0 the 121 123 O\n",
"20 0 continued 125 133 O\n",
"21 0 growth 135 140 O\n",
"22 0 of 142 143 O\n",
"23 0 our 145 147 O\n",
"24 0 subscription 149 160 O\n",
"25 0 SaaS 162 165 O\n",
"26 0 offerings 167 175 O\n",
"27 0 and 177 179 O\n",
"28 0 ongoing 181 187 O\n",
"29 0 maintenance 189 199 O\n",
"30 0 and 201 203 O\n",
"31 0 support 205 211 O\n",
"32 0 for 213 215 O\n",
"33 0 our 217 219 O\n",
"34 0 expanding 221 229 O\n",
"35 0 customer 231 238 O\n",
"36 0 base 240 243 O\n",
"37 0 . 245 245 O"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"result = light_model.fullAnnotate(text)\n",
"\n",
"ner_df= pd.DataFrame([(int(x.metadata['sentence']), x.result, x.begin, x.end, y.result) for x,y in zip(result[0][\"token\"], result[0][\"ner\"])], \n",
" columns=['sent_id','token','start','end','ner'])\n",
"ner_df"
]
},
{
"cell_type": "markdown",
"id": "xAdLlcMjejMm",
"metadata": {
"id": "xAdLlcMjejMm"
},
"source": [
"###π **Highlight Entities**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbx496QFQydD",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "fbx496QFQydD",
"outputId": "65bc0b3d-187c-4d13-9175-5deee99a568e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" $ CURRENCY 4.2 million AMOUNT of the increase was compensation related and primarily attributable to an increase in headcount to support the continued growth of our subscription SaaS offerings and ongoing maintenance and support for our expanding customer base ."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = result[0]\n",
"visualiser = nlp.viz.NerVisualizer()\n",
"visualiser.display(result, label_col='ner_span', document_col='document')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "z_sroXZZad8O",
"metadata": {
"id": "z_sroXZZad8O"
},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}