{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "V7YJ6M4i2jX1" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "3GkVa-4c2jX3" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/08.3.Finetuning_Table_Question_Answering.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "s8H-JKoXQoBc" }, "source": [ "#πŸ”Ž Finetuning TAPAS (strong supervision)\n", "In this notebook we show how you can finetune TAPAS using `transformers` library to then import the model into Spark NLP.\n", "\n", "**IMPORTANT: This is just an example and from JSL we don't provide support to solve errors or doubts with the transformers library.**\n" ] }, { "cell_type": "markdown", "metadata": { "id": "nZdHmYzfHjla" }, "source": [ "\n", "## Strong supervision\n", "Strong supervision is one of the approaches of Table Question Answering. During training time, an aggregation operator (COUNT, AVERAGE, SUM) is added to the answer, as well as the cells which answer to the quesiton in the pair question-answer.\n", "\n", "For example, if you want to get the SUM of the column \"revenue\":\n", "- The question will be: \"What is the total (or sum, or any other synonym) revenue?\n", "- The answer will be:\n", " - Cells: ALl the cells in the column \"revenue\"\n", " - The aggregator \"SUM\".\n" ] }, { "cell_type": "markdown", "metadata": { "id": "FhOrdAIsHliy" }, "source": [ "We are going to finetune `\"google/tapas-base\"` with the following aggregation operations:\n", "- NONE (just for retrieving cell(s) content)\n", "- SUM\n", "- AVERAGE\n", "- COUNT\n", "- MAX\n", "- MIN\n", "\n", "Feel free to experiment with any other operator!" ] }, { "cell_type": "markdown", "metadata": { "id": "3aacceac-af11-4ac0-b1e4-f9710b94c6f4" }, "source": [ "##πŸ”Ž Finetuning TAPAS models in HuggingFace πŸ€— \n", "\n", "Let's keep in mind a few things before we start 😊 \n", "\n", "- This feature is only available in `Spark NLP 4.2.0` and above. So please make sure you have upgraded to the latest Spark NLP release\n", "- You can import models for TAPAS from HuggingFace but they have to be compatible with `TensorFlow` and they have to be in `Table Question&Answering` category." ] }, { "cell_type": "markdown", "metadata": { "id": "TKdgnvAct9Dy" }, "source": [ "#🎬 Installing requirements" ] }, { "cell_type": "markdown", "metadata": { "id": "8955d73d-7d22-4c6b-936a-89963e02b551" }, "source": [ "- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", "- We lock TensorFlow on `2.4.1` version and Transformers on `4.6.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.\n", "- AlbertTokenizer requires the `SentencePiece` library, so we install that as well" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c_eizj5W4UAN" }, "outputs": [], "source": [ "%pip install tensorflow==2.10.0 transformers==4.22.1 datasets" ] }, { "cell_type": "markdown", "metadata": { "id": "otdmLjPtSsN5" }, "source": [ "#πŸ“Œ We need torch-scatter as well as a dependency of Tapas" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dZ6f0gMFgT9p" }, "outputs": [], "source": [ "! pip install torch-scatter -f https://data.pyg.org/whl/torch-1.12.0+cu112.html" ] }, { "cell_type": "markdown", "metadata": { "id": "082cd4c2-3e59-4a1c-a7b0-fba791b6b39d" }, "source": [ "##πŸ“Œ Loading from Hugging Face" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 104, "referenced_widgets": [ "8f88821673844721a988714cb4cc4b0b", "e5b7be2ee2af40438e6103b126d62a13", "5f5e6568c5c14e3e9fc0cb602dcca524", "ab171a0a4d6b45f0bf5b2afde61d2a0d", "b7fc543eee094a33aca45e77ca9e757a", "e49f036193854b1fbd2126578be5d39f", "31e3393e0d3e465d82dcff62b477c9ae", "e4fea04935234ff3a789052f9496029e", "6e9a664c7ace4030ba7a3364212573c8", "4e3e8c07211a4f12b4f86c620447f9da", "651f654c99924d87a3ccef7f57eb62c2" ] }, "id": "05745b4f-710e-4a64-a6e9-6997e8924edf", "outputId": "be2caab4-52e4-4059-a552-eecd29e72746" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Moving 0 files to the new cache system\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8f88821673844721a988714cb4cc4b0b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from transformers import TapasTokenizer, TapasForQuestionAnswering, TapasConfig\n", "import pandas as pd\n", "import tensorflow as tf\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a4b4e4a2-ad6a-4a72-b96b-86a0f9f060c2" }, "outputs": [], "source": [ "MODEL_NAME = \"google/tapas-base\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "630c18da-7d6f-4270-8269-05ffb2c96e48", "outputId": "909e5251-9d25-4cd1-cf64-ae21fd7bc555" }, "outputs": [ { "data": { "text/plain": [ "{'NONE': 0, 'SUM': 1, 'AVERAGE': 2, 'COUNT': 3, 'MAX': 4, 'MIN': 5}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aggregation_labels = {\n", " 0: \"NONE\",\n", " 1: \"SUM\",\n", " 2: \"AVERAGE\",\n", " 3: \"COUNT\",\n", " 4: \"MAX\",\n", " 5: \"MIN\"\n", "}\n", "\n", "lab_dict = {y:x for x,y in aggregation_labels.items()}\n", "lab_dict" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FP9j94CcH1Zq", "outputId": "dc043483-18ef-4ee9-9a17-8610977afef9" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of TapasForQuestionAnswering were not initialized from the model checkpoint at google/tapas-base and are newly initialized: ['column_output_weights', 'output_bias', 'aggregation_classifier.weight', 'output_weights', 'aggregation_classifier.bias', 'column_output_bias']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ], "source": [ "original_config = TapasConfig.from_pretrained(MODEL_NAME)\n", "#Let's first load the HF model\n", "config = TapasConfig(num_aggregation_labels=len(aggregation_labels),\n", " use_answer_as_supervision = False, #in case you’re using strong supervision, you should set use_answer_as_supervision of TapasConfig to False (because the ground truth aggregation label is given during training).\n", " cell_selection_preference = original_config.cell_selection_preference,\n", " aggregation_labels = aggregation_labels)\n", "\n", "tokenizer = TapasTokenizer.from_pretrained(MODEL_NAME)\n", "model = TapasForQuestionAnswering.from_pretrained(MODEL_NAME, config=config)" ] }, { "cell_type": "markdown", "metadata": { "id": "XRdlHW94H047" }, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YQCh0Ey9OlLP", "outputId": "85280479-a567-49a2-acbb-044833286ed1" }, "outputs": [ { "data": { "text/plain": [ "{0: 'NONE', 1: 'SUM', 2: 'AVERAGE', 3: 'COUNT', 4: 'MAX', 5: 'MIN'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.config.aggregation_labels " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5GTMkYUpGGah", "outputId": "246caf1d-204c-47e7-c661-f7310b8c1dae" }, "outputs": [ { "data": { "text/plain": [ "{'NONE': 0, 'SUM': 1, 'AVERAGE': 2, 'COUNT': 3, 'MAX': 4, 'MIN': 5}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lab_dict = {y:x for x,y in model.config.aggregation_labels.items()}\n", "lab_dict" ] }, { "cell_type": "markdown", "metadata": { "id": "Q18aVRv0goMl" }, "source": [ "#πŸ“Œ Datasets" ] }, { "cell_type": "markdown", "metadata": { "id": "sBpI3xSZTx5I" }, "source": [ "πŸ“œFor the creation of the dataset, we have follow these steps:\n", "1. We create a list of dataframes.\n", "2. We generate questions on the dataframe with their operators for each dataframe.\n", "3. We calculate the answer to each questions using Pandas.\n", "\n", "The dataset + the questions + the answers will be our training dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qMKDcqVuUjog" }, "outputs": [], "source": [ "!wget https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/data/tapas_example.pkl.bz2?raw=true" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JGzraqn7aIrD", "outputId": "8f2ab523-c686-4120-8f9c-f71f294ab20e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 5150 dataframes from datasource `tapas_example_by_JSL`\n" ] } ], "source": [ "import bz2, pickle\n", "datasource = \"tapas_example_by_JSL\"\n", "\n", "ifile = bz2.BZ2File(f\"tapas_example.pkl.bz2\",'rb')\n", "list_of_df = pickle.load(ifile)\n", "print(f\"Loaded {len(list_of_df)} dataframes from datasource `{datasource}`\")\n", "ifile.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "CMIGCiSIqjfn" }, "source": [ "#πŸ”Ž Configuration params" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n2Df8cL5IHdB" }, "outputs": [], "source": [ "MAX_DFS_PER_DATASOURCE = 2 # How many dataframes you want to use from list_of_df. This may require many resources, so leaving it to 1 for a quick demonstration.\n", "MAX_QUESTIONS_PER_DF = None # To add all questions, leave it to None. To randomly select n questions, set the value to a number\n", "\n", "EPOCHS = 8 # Number of times to go over each dataframe\n", "BATCH_SIZE = 16 # Number of questions to process per batch. In Colab, more than 16 will trigger CUDA out of Memory" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4JD9F7AYIw2C", "outputId": "dceb7c66-521f-4abe-bb05-3c272802371e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n", " FutureWarning,\n" ] } ], "source": [ "import torch\n", "from transformers import AdamW\n", "\n", "# GPU training\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "model.to(device)\n", "\n", "# Adam optimizer\n", "LR = 1e-5\n", "optimizer = AdamW(model.parameters(), lr=LR)" ] }, { "cell_type": "markdown", "metadata": { "id": "vJFOmg_iPY69" }, "source": [ "##βœ”οΈ Function to generate questions synthetically" ] }, { "cell_type": "markdown", "metadata": { "id": "gWoe9ajaJFFy" }, "source": [ "###βœ”οΈ Generate questions with NO aggregator, just the cell(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UpztzwwzyK4C" }, "outputs": [], "source": [ "def generate_NONE(datasource, list_of_df, max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, max_questions_per_df=MAX_QUESTIONS_PER_DF):\n", " \n", " # The question dataframe contains the following columns:\n", " # - A dymmy ID just to identify what type of question it is. It's ignored.\n", " # - The question text\n", " # - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", " # - If the aggreagation is not note, the FLOAT result of the operation. \n", " # - A list of tuples, where each tuple is a cell answering to the question\n", " # - A list of texts, where each text is the answer to the question\n", "\n", " questions_df = []\n", " for counter, df in enumerate(list_of_df):\n", " if max_dfs_per_datasource is not None and counter == max_dfs_per_datasource:\n", " break\n", " questions = []\n", " headers = df.columns\n", " # CELL IN FIRST ROW\n", " questions = [\n", " (0,f\"what is the first {x}\", \"NONE\", np.nan, [(0,i)], [df.iloc[0,i]]) for i, x in enumerate(headers)\n", " ]\n", " \n", " # CELL IN LAST ROW\n", " questions.extend ( [\n", " (1,f\"what is the last {x}\", \"NONE\", np.nan, [(len(df)-1,i)], [df.iloc[len(df)-1,i]]) for i, x in enumerate(headers)\n", " ] )\n", " \n", " # FIRST ROW\n", " # Warning: 0,0 in TAPAS means the first row, while in Pandas it means the column header!\n", " # So when extracting the text (df.iloc) we start at 1\n", " questions.append((2, \"what is the first row\", \"NONE\", np.nan, [(0, i) for i, x in enumerate(headers)], [df.iloc[0, i] for i, x in enumerate(headers)]))\n", " questions.append((3, \"what is the first entry\", \"NONE\", np.nan, [(0, i) for i, x in enumerate(headers)], [df.iloc[0, i] for i, x in enumerate(headers)]))\n", " \n", " \n", " # LAST ROW\n", " # For TAPAS, last row is len(df)-1, but in Panads we access to that cell by df.iloc[len(df)]\n", " questions.append((4, \"what is the last row\", \"NONE\", np.nan, [(len(df)-1, i) for i, x in enumerate(headers)], [df.iloc[len(df)-1, i] for i, x in enumerate(headers)]))\n", " questions.append((5, \"what is the last entry\", \"NONE\", np.nan, [(len(df)-1, i) for i, x in enumerate(headers)], [df.iloc[len(df)-1, i] for i, x in enumerate(headers)]))\n", " \n", " \n", " # ASKING FOR 1 CELL WITH ANOTHER CELL AS REFERENCE\n", " for numrow, t in enumerate(df.iterrows()):\n", " for numcol, h in enumerate(headers):\n", " other_headers = [x for x in headers if x != h]\n", " for i, o in enumerate(other_headers):\n", " col = numcol\n", " q = t[1].loc[o]\n", " # Checking if there are many rows\n", " rows = df[df[o]==q].index.tolist()\n", " count = float(len(rows))\n", " questions.append((6, f\"what is the {h} when {o} is {q}\", \"NONE\", np.nan, [(r, col) for r in rows], [df.iloc[r,col] for r in rows] )) \n", "\n", "\n", " questions.append(\n", " (7,f\"how big is the table\", \"NONE\", np.nan, [(len(df)-1, len(headers)-1)], [df.iloc[len(df)-1, len(headers)-1]]))\n", " questions.append(\n", " (8,f\"what is the size of the table\", \"NONE\", np.nan, [(len(df)-1, len(headers)-1)], [df.iloc[len(df)-1, len(headers)-1]]))\n", " questions.append(\n", " (9,f\"what is the last cell of the table\", \"NONE\", np.nan, [(len(df)-1, len(headers)-1)], [df.iloc[len(df)-1, len(headers)-1]]))\n", " questions.append(\n", " (10,f\"what is the first cell of the table\", \"NONE\", np.nan, [(0,0)], [df.iloc[0, 0]]))\n", "\n", " max_q = len(questions)\n", " if max_questions_per_df is not None:\n", " max_q = min(max_questions_per_df, max_q)\n", " \n", " question_sampling = random.sample(questions, max_q)\n", " question_sampling_df = pd.DataFrame(question_sampling, columns=['id', 'question', 'aggr', 'float_answer', 'cells', 'cell_texts'])\n", " \n", " question_sampling_df['table'] = pickle.dumps(df.to_dict())\n", "\n", " # I add all the dataframes with their questions to a tuple, keeping the datasource it was taken from\n", " questions_df.append(question_sampling_df)\n", " \n", " return questions_df" ] }, { "cell_type": "markdown", "metadata": { "id": "rjZOOQZbJLP4" }, "source": [ "###βœ”οΈ Generate questions with COUNT aggregator and the cell(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oGvbwThhx1bz" }, "outputs": [], "source": [ "import pickle\n", "def generate_COUNT(datasource, list_of_df, max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, max_questions_per_df=MAX_QUESTIONS_PER_DF):\n", " \n", " # The question dataframe contains the following columns:\n", " # - A dymmy ID just to identify what type of question it is. It's ignored.\n", " # - The question text\n", " # - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", " # - the FLOAT result of the operation, not used unless you want to train a weak supervision model\n", " # - A list of tuples, where each tuple is a cell answering to the question\n", " # - A list of texts, where each text is the answer to the question\n", "\n", " questions_df = []\n", " for counter, df in enumerate(list_of_df):\n", " if max_dfs_per_datasource is not None and counter == max_dfs_per_datasource:\n", " break\n", " questions = []\n", " headers = df.columns\n", " \n", " # ASKING FOR 1 CELL WITH ANOTHER CELL AS REFERENCE\n", " for numrow, t in enumerate(df.iterrows()):\n", " for numcol, h in enumerate(headers):\n", " other_headers = [x for x in headers if x != h]\n", " for i, o in enumerate(other_headers):\n", " col = numcol\n", " q = t[1].loc[o]\n", " # Checking if there are many rows\n", " rows = df[df[o]==q].index.tolist()\n", " count = float(len(rows))\n", " questions.append((5, f\"how many times {o} is {q}\", \"COUNT\", count, [(r, col) for r in rows], [df.iloc[r,col] for r in rows] )) \n", " questions.append((5, f\"count the number of times {o} is {q}\", \"COUNT\", count, [(r, col) for r in rows], [df.iloc[r,col] for r in rows] )) \n", "\n", " \n", "\n", " # COUNT\n", " questions.extend([\n", " (0,f\"how many {x} are there\", \"COUNT\", float(len(df)), [(a, i) for a in range(0, len(df))], [df.iloc[a, i] for a in range(0, len(df))]) for i, x in enumerate(headers)])\n", " questions.extend([\n", " (1,f\"how many {x} do we have\", \"COUNT\", float(len(df)), [(a, i) for a in range(0, len(df))], [df.iloc[a, i] for a in range(0, len(df))]) for i, x in enumerate(headers)])\n", " questions.extend([\n", " (2,f\"how many {x} does the table have\", \"COUNT\", float(len(df)), [(a, i) for a in range(0, len(df))], [df.iloc[a, i] for a in range(0, len(df))]) for i, x in enumerate(headers)])\n", " questions.extend([\n", " (3,f\"how many {x} has the table\", \"COUNT\", float(len(df)), [(a, i) for a in range(0, len(df))], [df.iloc[a, i] for a in range(0, len(df))]) for i, x in enumerate(headers)])\n", " \n", " questions.append(\n", " (4,f\"how many rows do we have\", \"COUNT\", float(len(df)), [(a, 0) for a in range(0, len(df))], [df.iloc[a, 0] for a in range(0, len(df))]))\n", " questions.append(\n", " (5,f\"how many rows are there\", \"COUNT\", float(len(df)), [(a, 0) for a in range(0, len(df))], [df.iloc[a, 0] for a in range(0, len(df))]))\n", " questions.append(\n", " (6,f\"how many rows does the table have\", \"COUNT\", float(len(df)), [(a, 0) for a in range(0, len(df))], [df.iloc[a, 0] for a in range(0, len(df))]))\n", " questions.append(\n", " (7,f\"how many rows has the table\", \"COUNT\", float(len(df)), [(a, 0) for a in range(0, len(df))], [df.iloc[a, 0] for a in range(0, len(df))]))\n", "\n", " max_q = len(questions)\n", " if max_questions_per_df is not None:\n", " max_q = min(max_questions_per_df, max_q)\n", " \n", " question_sampling = random.sample(questions, max_q)\n", " question_sampling_df = pd.DataFrame(question_sampling, columns=['id', 'question', 'aggr', 'float_answer', 'cells', 'cell_texts'])\n", " \n", " question_sampling_df['table'] = pickle.dumps(df.to_dict())\n", "\n", " # I add all the dataframes with their questions to a tuple, keeping the datasource it was taken from\n", " questions_df.append(question_sampling_df)\n", " \n", " return questions_df" ] }, { "cell_type": "markdown", "metadata": { "id": "17gOTPyxJRZi" }, "source": [ "###βœ”οΈ Generate questions with SUM aggregator and the cell(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "z9ONAbq6xaDD" }, "outputs": [], "source": [ "import pickle\n", "def generate_SUM(datasource, list_of_df, max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, max_questions_per_df=MAX_QUESTIONS_PER_DF):\n", "# The question dataframe contains the following columns:\n", " # - A dymmy ID just to identify what type of question it is. It's ignored.\n", " # - The question text\n", " # - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", " # - the FLOAT result of the operation, not used unless you want to train a weak supervision model\n", " # - A list of tuples, where each tuple is a cell answering to the question\n", " # - A list of texts, where each text is the answer to the question\n", "\n", " questions_df = []\n", " for counter, df in enumerate(list_of_df):\n", " if max_dfs_per_datasource is not None and counter == max_dfs_per_datasource:\n", " break\n", " questions = []\n", " headers = df.columns\n", " \n", " # SUM\n", " quantity = list(headers).index('quantity')\n", " qty_result_list = [df.iloc[i, quantity] for i in range(0, len(df))]\n", " qty_sum = sum([float(x.replace(',','')) for x in qty_result_list])\n", " questions.append((0, \"what is the quantity total\", \"SUM\", qty_sum, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((1, \"what is the total of quantity\", \"SUM\", qty_sum, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((2, \"what is the total for quantity\", \"SUM\", qty_sum, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((3, \"what is the sum for quantity\", \"SUM\", qty_sum, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " \n", " # SUM AND AVERAGE WITH CONDITION\n", " for numrow, t in enumerate(df.iterrows()):\n", " for i, h in enumerate(headers): \n", " if h in ['quantity', 'percentage']:\n", " continue\n", " numcol = i\n", " q = t[1].loc[h]\n", " rows = df[df[h]==q].index.tolist()\n", " \n", " qty_result_list = [df.iloc[r, quantity] for r in rows] \n", " qty_sum = sum([float(x.replace(',','')) for x in qty_result_list])\n", " qty_avg = mean([float(x.replace(',','')) for x in qty_result_list])\n", " qty_max = max([float(x.replace(',','')) for x in qty_result_list])\n", " qty_min= min([float(x.replace(',','')) for x in qty_result_list])\n", "\n", " # Quantity\n", " questions.append((4, f\"what is the overall quantity when {h} is {q}\", \"SUM\", qty_sum, [(r, quantity) for r in rows],qty_result_list )) \n", " questions.append((5, f\"what is the total of quantity when {h} is {q}\", \"SUM\", qty_sum, [(r, quantity) for r in rows], qty_result_list )) \n", " questions.append((6, f\"what is the sum of quantity when {h} is {q}\", \"SUM\", qty_sum, [(r, quantity) for r in rows], qty_result_list ))\n", "\n", " max_q = len(questions)\n", " if max_questions_per_df is not None:\n", " max_q = min(max_questions_per_df, max_q)\n", " \n", " question_sampling = random.sample(questions, max_q)\n", " question_sampling_df = pd.DataFrame(question_sampling, columns=['id', 'question', 'aggr', 'float_answer', 'cells', 'cell_texts'])\n", " \n", " question_sampling_df['table'] = pickle.dumps(df.to_dict())\n", "\n", " # I add all the dataframes with their questions to a tuple, keeping the datasource it was taken from\n", " questions_df.append(question_sampling_df)\n", " \n", " return questions_df " ] }, { "cell_type": "markdown", "metadata": { "id": "-TFqLdidJUa7" }, "source": [ "###βœ”οΈ Generate questions with AVERAGE aggregator and the cell(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p-gmvpKaykUY" }, "outputs": [], "source": [ "import pickle\n", "def generate_AVERAGE(datasource, list_of_df, max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, max_questions_per_df=MAX_QUESTIONS_PER_DF):\n", " \n", " # The question dataframe contains the following columns:\n", " # - A dymmy ID just to identify what type of question it is. It's ignored.\n", " # - The question text\n", " # - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", " # - the FLOAT result of the operation, not used unless you want to train a weak supervision model\n", " # - A list of tuples, where each tuple is a cell answering to the question\n", " # - A list of texts, where each text is the answer to the question\n", "\n", " questions_df = []\n", " for counter, df in enumerate(list_of_df):\n", " if max_dfs_per_datasource is not None and counter == max_dfs_per_datasource:\n", " break\n", " questions = []\n", " headers = df.columns\n", " \n", " percentage = list(headers).index('percentage')\n", " per_result_list = [df.iloc[i, percentage] for i in range(0, len(df))]\n", " per_avg = mean([float(x.replace('%','')) for x in per_result_list])\n", " questions.append((0, \"what is the percentage average\", \"AVERAGE\", per_avg, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", " questions.append((1, \"what is the mean percentage\", \"AVERAGE\", per_avg, [(i, percentage) for i in range(0, len(df))],per_result_list )) \n", " questions.append((2, \"what is the average percentage\", \"AVERAGE\", per_avg, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", "\n", " \n", " quantity = list(headers).index('quantity')\n", " # SUM AND AVERAGE WITH CONDITION\n", " for numrow, t in enumerate(df.iterrows()):\n", " for i, h in enumerate(headers): \n", " if h in ['quantity', 'percentage']:\n", " continue\n", " numcol = i\n", " q = t[1].loc[h]\n", " rows = df[df[h]==q].index.tolist()\n", " \n", " qty_result_list = [df.iloc[r, quantity] for r in rows] \n", " qty_sum = sum([float(x.replace(',','')) for x in qty_result_list])\n", " qty_avg = mean([float(x.replace(',','')) for x in qty_result_list])\n", " qty_max = max([float(x.replace(',','')) for x in qty_result_list])\n", " qty_min= min([float(x.replace(',','')) for x in qty_result_list])\n", "\n", " questions.append((3, f\"what is the quantity average when {h} is {q}\", \"AVERAGE\", qty_avg, [(r, quantity) for r in rows], qty_result_list )) \n", " questions.append((4, f\"what is the average quantity when {h} is {q}\", \"AVERAGE\", qty_avg, [(r, quantity) for r in rows], qty_result_list )) \n", " questions.append((5, f\"what is the mean of quantity when {h} is {q}\", \"AVERAGE\", qty_avg, [(r, quantity) for r in rows], qty_result_list ))\n", " questions.append((6, f\"what is the quantity mean when {h} is {q}\", \"AVERAGE\", qty_avg, [(r, quantity) for r in rows], qty_result_list ))\n", "\n", "\n", " # Percentage\n", " per_result_list = [df.iloc[r, percentage] for r in rows]\n", " per_avg = mean([float(x.replace('%','')) for x in per_result_list])\n", " per_max = max([float(x.replace('%','')) for x in per_result_list])\n", " per_min= min([float(x.replace('%','')) for x in per_result_list])\n", "\n", "\n", " questions.append((7, f\"what is the mean percentage when {h} is {q}\", \"AVERAGE\", per_avg, [(r, percentage) for r in rows], per_result_list )) \n", " questions.append((8, f\"what is the average percentage when {h} is {q}\", \"AVERAGE\", per_avg, [(r, percentage) for r in rows], per_result_list ))\n", "\n", " # MEAN / AVERAGE\n", " \n", " qty_result_list = [df.iloc[i, quantity] for i in range(0, len(df))] \n", " qty_avg = mean([float(x.replace(',','')) for x in qty_result_list])\n", " qty_max = max([float(x.replace(',','')) for x in qty_result_list])\n", " qty_min = min([float(x.replace(',','')) for x in qty_result_list])\n", "\n", " qty_result_list = [df.iloc[i, quantity] for i in range(0, len(df))] \n", " questions.append((9, \"what is the quantity average\", \"AVERAGE\", qty_avg, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((10, \"what is the mean of quantity\", \"AVERAGE\", qty_avg, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((11, \"what is the average quantity\", \"AVERAGE\", qty_avg, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((12, \"what is the quantity mean\", \"AVERAGE\", qty_avg, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " \n", " per_result_list = [df.iloc[i, percentage] for i in range(0, len(df))]\n", " per_max = max([float(x.replace('%','')) for x in per_result_list])\n", " per_min = min([float(x.replace('%','')) for x in per_result_list])\n", "\n", " max_q = len(questions)\n", " if max_questions_per_df is not None:\n", " max_q = min(max_questions_per_df, max_q)\n", " \n", " question_sampling = random.sample(questions, max_q)\n", " question_sampling_df = pd.DataFrame(question_sampling, columns=['id', 'question', 'aggr', 'float_answer', 'cells', 'cell_texts'])\n", " \n", " question_sampling_df['table'] = pickle.dumps(df.to_dict())\n", "\n", " # I add all the dataframes with their questions to a tuple, keeping the datasource it was taken from\n", " questions_df.append(question_sampling_df)\n", " \n", " return questions_df" ] }, { "cell_type": "markdown", "metadata": { "id": "6Oil2CWhJWDZ" }, "source": [ "###βœ”οΈ Generate questions with MAX aggregator and the cell(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4f0H5KcCkCW8" }, "outputs": [], "source": [ "import pickle\n", "def generate_MAX(datasource, list_of_df, max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, max_questions_per_df=MAX_QUESTIONS_PER_DF):\n", "# The question dataframe contains the following columns:\n", " # - A dymmy ID just to identify what type of question it is. It's ignored.\n", " # - The question text\n", " # - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", " # - the FLOAT result of the operation, not used unless you want to train a weak supervision model\n", " # - A list of tuples, where each tuple is a cell answering to the question\n", " # - A list of texts, where each text is the answer to the question\n", "\n", " questions_df = []\n", " for counter, df in enumerate(list_of_df):\n", " if max_dfs_per_datasource is not None and counter == max_dfs_per_datasource:\n", " break\n", " questions = []\n", " headers = df.columns\n", " \n", " # MAX\n", " quantity = list(headers).index('quantity')\n", " qty_result_list = [df.iloc[i, quantity] for i in range(0, len(df))]\n", " qty_max = max([float(x.replace(',','')) for x in qty_result_list])\n", " questions.append((1, \"what is the quantity max\", \"MAX\", qty_max, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((2, \"what is the max of quantity\", \"MAX\", qty_max, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((3, \"what is the max for quantity\", \"MAX\", qty_max, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((4, \"what is the maximum quantity\", \"MAX\", qty_max, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((5, \"what is the highest quantity\", \"MAX\", qty_max, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " \n", " percentage = list(headers).index('percentage')\n", " per_result_list = [df.iloc[i, percentage] for i in range(0, len(df))]\n", " per_max = max([float(x.replace('%','')) for x in per_result_list])\n", " questions.append((6, \"what is the percentage maximum\", \"AVERAGE\", per_max, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", " questions.append((7, \"what is the max percentage\", \"AVERAGE\", per_max, [(i, percentage) for i in range(0, len(df))],per_result_list )) \n", " questions.append((8, \"what is the maximum percentage\", \"AVERAGE\", per_max, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", " questions.append((9, \"what is the highest percentage\", \"AVERAGE\", per_max, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", "\n", " # MAX WITH CONDITION\n", " for numrow, t in enumerate(df.iterrows()):\n", " for i, h in enumerate(headers): \n", " if h in ['quantity', 'percentage']:\n", " continue\n", " numcol = i\n", " q = t[1].loc[h]\n", " rows = df[df[h]==q].index.tolist()\n", " \n", " qty_result_list = [df.iloc[r, quantity] for r in rows] \n", " qty_max = max([float(x.replace(',','')) for x in qty_result_list])\n", "\n", " # Quantity\n", " questions.append((10, f\"what is the maximum quantity when {h} is {q}\", \"MAX\", qty_max, [(r, quantity) for r in rows],qty_result_list )) \n", " questions.append((11, f\"what is the max of quantity when {h} is {q}\", \"MAX\", qty_max, [(r, quantity) for r in rows], qty_result_list )) \n", " questions.append((12, f\"what is the quantity max when {h} is {q}\", \"MAX\", qty_max, [(r, quantity) for r in rows], qty_result_list ))\n", " questions.append((13, f\"what is the highest quantity when {h} is {q}\", \"MAX\", qty_max, [(r, quantity) for r in rows], qty_result_list ))\n", "\n", " \n", " # Percentage\n", " per_result_list = [df.iloc[r, percentage] for r in rows]\n", " per_max = max([float(x.replace('%','')) for x in per_result_list])\n", "\n", "\n", " questions.append((14, f\"what is the max percentage when {h} is {q}\", \"MAX\", per_max, [(r, percentage) for r in rows], per_result_list )) \n", " questions.append((15, f\"what is the maximum percentage when {h} is {q}\", \"MAX\", per_max, [(r, percentage) for r in rows], per_result_list ))\n", " questions.append((16, f\"what is the highest percentage when {h} is {q}\", \"MAX\", per_max, [(r, percentage) for r in rows], per_result_list ))\n", "\n", " max_q = len(questions)\n", " if max_questions_per_df is not None:\n", " max_q = min(max_questions_per_df, max_q)\n", " \n", " question_sampling = random.sample(questions, max_q)\n", " question_sampling_df = pd.DataFrame(question_sampling, columns=['id', 'question', 'aggr', 'float_answer', 'cells', 'cell_texts'])\n", " \n", " question_sampling_df['table'] = pickle.dumps(df.to_dict())\n", "\n", " # I add all the dataframes with their questions to a tuple, keeping the datasource it was taken from\n", " questions_df.append(question_sampling_df)\n", " \n", " return questions_df " ] }, { "cell_type": "markdown", "metadata": { "id": "ItnpGj-7JXbS" }, "source": [ "###βœ”οΈ Generate questions with MIN aggregator and the cell(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aKp9K0WXkpkL" }, "outputs": [], "source": [ "import pickle\n", "def generate_MIN(datasource, list_of_df, max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, max_questions_per_df=MAX_QUESTIONS_PER_DF):\n", "# The question dataframe contains the following columns:\n", " # - A dymmy ID just to identify what type of question it is. It's ignored.\n", " # - The question text\n", " # - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", " # - the FLOAT result of the operation, not used unless you want to train a weak supervision model\n", " # - A list of tuples, where each tuple is a cell answering to the question\n", " # - A list of texts, where each text is the answer to the question\n", "\n", " questions_df = []\n", " for counter, df in enumerate(list_of_df):\n", " if max_dfs_per_datasource is not None and counter == max_dfs_per_datasource:\n", " break\n", " questions = []\n", " headers = df.columns\n", " \n", " # MAX\n", " quantity = list(headers).index('quantity')\n", " qty_result_list = [df.iloc[i, quantity] for i in range(0, len(df))]\n", " qty_min = min([float(x.replace(',','')) for x in qty_result_list])\n", " questions.append((1, \"what is the quantity min\", \"MIN\", qty_min, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((2, \"what is the min of quantity\", \"MIN\", qty_min, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((3, \"what is the min for quantity\", \"MIN\", qty_min, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((4, \"what is the minimum quantity\", \"MIN\", qty_min, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " questions.append((5, \"what is the lowest quantity\", \"MIN\", qty_min, [(i, quantity) for i in range(0, len(df))], qty_result_list )) \n", " \n", " percentage = list(headers).index('percentage')\n", " per_result_list = [df.iloc[i, percentage] for i in range(0, len(df))]\n", " per_min = min([float(x.replace('%','')) for x in per_result_list])\n", " questions.append((6, \"what is the percentage min\", \"MIN\", per_min, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", " questions.append((7, \"what is the min percentage\", \"MIN\", per_min, [(i, percentage) for i in range(0, len(df))],per_result_list )) \n", " questions.append((8, \"what is the minimum percentage\", \"MIN\", per_min, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", " questions.append((9, \"what is the lowest percentage\", \"MIN\", per_min, [(i, percentage) for i in range(0, len(df))], per_result_list )) \n", "\n", " # MAX WITH CONDITION\n", " for numrow, t in enumerate(df.iterrows()):\n", " for i, h in enumerate(headers): \n", " if h in ['quantity', 'percentage']:\n", " continue\n", " numcol = i\n", " q = t[1].loc[h]\n", " rows = df[df[h]==q].index.tolist()\n", " \n", " qty_result_list = [df.iloc[r, quantity] for r in rows] \n", " qty_min= min([float(x.replace(',','')) for x in qty_result_list])\n", "\n", " # Quantity\n", " questions.append((10, f\"what is the minimum quantity when {h} is {q}\", \"MIN\", qty_min, [(r, quantity) for r in rows],qty_result_list )) \n", " questions.append((11, f\"what is the min of quantity when {h} is {q}\", \"MIN\", qty_min, [(r, quantity) for r in rows], qty_result_list )) \n", " questions.append((12, f\"what is the quantity min when {h} is {q}\", \"MIN\", qty_min, [(r, quantity) for r in rows], qty_result_list ))\n", " questions.append((13, f\"what is the lowest quantity when {h} is {q}\", \"MIN\", qty_min, [(r, quantity) for r in rows], qty_result_list ))\n", "\n", " \n", " # Percentage\n", " per_result_list = [df.iloc[r, percentage] for r in rows]\n", " per_min= min([float(x.replace('%','')) for x in per_result_list])\n", "\n", " questions.append((14, f\"what is the min percentage when {h} is {q}\", \"MIN\", per_min, [(r, percentage) for r in rows], per_result_list )) \n", " questions.append((15, f\"what is the minimum percentage when {h} is {q}\", \"MIN\", per_min, [(r, percentage) for r in rows], per_result_list ))\n", " questions.append((16, f\"what is the lowest percentage when {h} is {q}\", \"MIN\", per_min, [(r, percentage) for r in rows], per_result_list ))\n", "\n", " max_q = len(questions)\n", " if max_questions_per_df is not None:\n", " max_q = min(max_questions_per_df, max_q)\n", " \n", " question_sampling = random.sample(questions, max_q)\n", " question_sampling_df = pd.DataFrame(question_sampling, columns=['id', 'question', 'aggr', 'float_answer', 'cells', 'cell_texts'])\n", " \n", " question_sampling_df['table'] = pickle.dumps(df.to_dict())\n", "\n", " # I add all the dataframes with their questions to a tuple, keeping the datasource it was taken from\n", " questions_df.append(question_sampling_df)\n", " \n", " return questions_df " ] }, { "cell_type": "markdown", "metadata": { "id": "AM07J_8yS73A" }, "source": [ "#πŸ“š Data Loader" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "45oJ_yVnIw8Y" }, "outputs": [], "source": [ "import torch \n", "class TableDataset(torch.utils.data.Dataset):\n", " def __init__(self, df, tokenizer):\n", " self.df = df\n", " self.tokenizer = tokenizer\n", "\n", " def __getitem__(self, idx):\n", " item = self.df.iloc[idx]\n", " table = pickle.loads(item.table)\n", " df_table = pd.DataFrame(table)\n", " # this means it's the first table-question pair in a sequence\n", " encoding = self.tokenizer(table=df_table, \n", " queries=item.question, \n", " answer_coordinates=item.cells, \n", " answer_text=item.cell_texts,\n", " padding=\"max_length\",\n", " truncation=True,\n", " return_tensors=\"pt\"\n", " )\n", " aggr = lab_dict[item.aggr]\n", " \n", " # remove the batch dimension which the tokenizer adds \n", " encoding = {key: val.squeeze(0) for key, val in encoding.items()}\n", " # Aggregation operation ID\n", " encoding[\"aggr\"] = torch.tensor([aggr], dtype=torch.int64)\n", " \n", " # Float Answer is for weak aggregation. That is, when you return the final value of the aggregation.\n", " # We are leaving this here in case you want to experiment, no keep in mind in original TAPAS model\n", " # only SUM, AVERAGE and COUNT can be used for weak-supervision.\n", "\n", " # encoding[\"float_answer\"] = torch.tensor([item.float_answer], dtype=torch.float32)\n", " \n", " return encoding \n", "\n", " def __len__(self):\n", " return len(self.df)" ] }, { "cell_type": "markdown", "metadata": { "id": "oNG1j8G6TBYP" }, "source": [ "#πŸš€ Training loop" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fsa2k16lfR8Z" }, "outputs": [], "source": [ "import torch\n", "from transformers import AdamW\n", "\n", "# GPU training\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "model.to(device)\n", "\n", "# Adam optimizer\n", "optimizer = AdamW(model.parameters(), lr=LR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 633 }, "id": "TW2vbMx5sA34", "outputId": "0010e0be-f854-4712-d925-2e4705c238c4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "- Generating questions...\n", "- Epoch: 0\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 1.5686425358767153\n", "--- Errors: 0\n", "- Epoch: 1\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 1.1960902499281898\n", "--- Errors: 0\n", "- Epoch: 2\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 0.9914937746003347\n", "--- Errors: 0\n", "- Epoch: 3\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 1.031176200393808\n", "--- Errors: 0\n", "- Epoch: 4\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 1.0282052185060915\n", "--- Errors: 0\n", "- Epoch: 5\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 0.8980713966506608\n", "--- Errors: 0\n", "- Epoch: 6\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 1.4654125168596366\n", "--- Errors: 0\n", "- Epoch: 7\n", "-- Questions Dataframe Num.: 11/12\n", "--- Epoch Loss: 1.1752010677899318\n", "--- Errors: 0\n" ] }, { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'/content/tapas_example_by_JSL_model_.zip'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "from statistics import mean\n", "import sys \n", "import shutil \n", "import bz2\n", "import pickle\n", "import random\n", "\n", "# The question dataframe contains the following columns:\n", "# - A dymmy ID just to identify what type of question it is. It's ignored.\n", "# - The question text\n", "# - The aggregation string: NONE, COUNT, AVERAGE, SUM, MAX, MIN\n", "# - If the aggreagation is not note, the FLOAT result of the operation. \n", "# - A list of tuples, where each tuple is a cell answering to the question\n", "# - A list of texts, where each text is the answer to the question\n", "print(\"- Generating questions...\")\n", "questions_df = []\n", "questions_df.extend(generate_NONE(datasource, \n", " list_of_df, \n", " max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, \n", " max_questions_per_df=MAX_QUESTIONS_PER_DF))\n", "\n", "questions_df.extend(generate_SUM(datasource, \n", " list_of_df, \n", " max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, \n", " max_questions_per_df=MAX_QUESTIONS_PER_DF))\n", "\n", "questions_df.extend(generate_AVERAGE(datasource, \n", " list_of_df, \n", " max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, \n", " max_questions_per_df=MAX_QUESTIONS_PER_DF))\n", "\n", "questions_df.extend(generate_COUNT(datasource, \n", " list_of_df, \n", " max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, \n", " max_questions_per_df=MAX_QUESTIONS_PER_DF))\n", "\n", "questions_df.extend(generate_MAX(datasource, \n", " list_of_df, \n", " max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, \n", " max_questions_per_df=MAX_QUESTIONS_PER_DF))\n", "\n", "questions_df.extend(generate_MIN(datasource, \n", " list_of_df, \n", " max_dfs_per_datasource=MAX_DFS_PER_DATASOURCE, \n", " max_questions_per_df=MAX_QUESTIONS_PER_DF))\n", "\n", "total_df = len(questions_df)\n", "\n", "# We will aggregate all the losses for all the dataframes and show the\n", "# AVERAGE LOSS as the EPOCH loss\n", "epoch_losses = []\n", "\n", "for epoch in range(EPOCHS): # loop over the dataset multiple times\n", " \n", " # Sometimes you may get errors in pandas resolving the questions.\n", " # I count how many errors we get during the epochs.\n", " errors = 0 \n", " \n", " print(\"- Epoch:\", epoch) \n", "\n", " df_losses = []\n", " for i, question_df in enumerate(questions_df):\n", " print(f\"\\r-- Questions Dataframe Num.: {i}/{total_df}\", end=\"\")\n", "\n", " train_dataset = TableDataset(df=question_df, tokenizer=tokenizer)\n", " train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE)\n", " losses = []\n", " for idx, batch in enumerate(train_dataloader): \n", " # get the inputs;\n", " input_ids = batch[\"input_ids\"].to(device)\n", " attention_mask = batch[\"attention_mask\"].to(device)\n", " token_type_ids = batch[\"token_type_ids\"].to(device)\n", " labels = batch[\"labels\"].to(device)\n", " aggr_labels = batch[\"aggr\"].to(device)\n", " # Only use it for weak-supervision, not for strong-supervision\n", " # float_answer = batch[\"float_answer\"].to(device)\n", " numeric_values = batch[\"numeric_values\"].to(device)\n", " numeric_values_scale = batch[\"numeric_values_scale\"].to(device)\n", "\n", " # zero the parameter gradients\n", " optimizer.zero_grad()\n", " # forward + backward + optimize\n", " outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,\n", " labels=labels, aggregation_labels = aggr_labels, numeric_values=numeric_values,\n", " numeric_values_scale=numeric_values_scale) #float_answer=float_answer, only provide if weak-supervision\n", " loss = outputs.loss\n", " losses.append(loss.item())\n", " loss.backward()\n", " optimizer.step()\n", " df_losses.append(mean(losses))\n", " \n", " epoch_mean_loss = mean(df_losses)\n", " epoch_losses.append(epoch_mean_loss)\n", " print(f\"\\n--- Epoch Loss: {epoch_mean_loss}\")\n", " print(f\"--- Errors: {errors}\")\n", "\n", "# Saving model (this creates the folder too) \n", "mod = f'{datasource}/model/'\n", "modr = mod.replace(\"/\",\"_\")\n", "model.save_pretrained(save_directory=mod,\n", " is_main_process=True, state_dict=model.state_dict())\n", "\n", "shutil.make_archive(modr, # File name\n", " 'zip',\n", " './',\n", " mod)" ] }, { "cell_type": "markdown", "metadata": { "id": "uSABUSaf5J_V" }, "source": [ "#πŸš€ Inference" ] }, { "cell_type": "markdown", "metadata": { "id": "ZktsJ0sUboTw" }, "source": [ "Let's test our model." ] }, { "cell_type": "markdown", "metadata": { "id": "2bkn8sx1bp1A" }, "source": [ "##🎌 Inference and Visualization functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QRFoJYQJEB7x" }, "outputs": [], "source": [ "#process results using HF routine\n", "def process_results(model, logits, logits_aggregation):\n", " id2aggregation = model.config.aggregation_labels \n", " \n", " if logits_aggregation is not None:\n", " predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(\n", " inputs, logits, logits_aggregation\n", " )\n", " print(predicted_aggregation_indices)\n", " aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]\n", " else:\n", " predicted_answer_coordinates = tokenizer.convert_logits_to_predictions(\n", " inputs, logits\n", " )[0] \n", " aggregation_predictions_string = [id2aggregation[0] for x in range(logits.shape[0])]\n", " \n", " return predicted_answer_coordinates, aggregation_predictions_string\n", "\n", "#show results\n", "def show_results(model, logits, logits_aggregation, queries):\n", " predicted_answer_coordinates, aggregation_predictions_string = process_results(model, logits, logits_aggregation)\n", " answers = []\n", " for coordinates in predicted_answer_coordinates:\n", " if len(coordinates) == 1:\n", " # only a single cell:\n", " answers.append(table.iat[coordinates[0]])\n", " else:\n", " # multiple cells\n", " cell_values = []\n", " for coordinate in coordinates:\n", " cell_values.append(table.iat[coordinate])\n", " answers.append(\", \".join(cell_values))\n", "\n", " display(table)\n", " print(\"\")\n", " for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):\n", " print(query)\n", " if predicted_agg == \"NONE\":\n", " print(\"Predicted answer: \" + answer)\n", " else:\n", " print(\"Predicted answer: \" + predicted_agg + \" > \" + answer) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SOZNxu1T2W0G" }, "outputs": [], "source": [ "import collections\n", "import numpy as np\n", "\n", "def compute_prediction_sequence(model, data, device):\n", " \"\"\"Computes predictions using model's answers to the previous questions.\"\"\"\n", " \n", " # prepare data\n", " input_ids = data[\"input_ids\"].to(device)\n", " attention_mask = data[\"attention_mask\"].to(device)\n", " token_type_ids = data[\"token_type_ids\"].to(device)\n", "\n", " all_logits = []\n", " prev_answers = None\n", "\n", " num_batch = data[\"input_ids\"].shape[0]\n", " \n", " for idx in range(num_batch):\n", " \n", " if prev_answers is not None:\n", " coords_to_answer = prev_answers[idx]\n", " # Next, set the label ids predicted by the model\n", " prev_label_ids_example = token_type_ids_example[:,3] # shape (seq_len,)\n", " model_label_ids = np.zeros_like(prev_label_ids_example.cpu().numpy()) # shape (seq_len,)\n", "\n", " # for each token in the sequence:\n", " token_type_ids_example = token_type_ids[idx] # shape (seq_len, 7)\n", " for i in range(model_label_ids.shape[0]):\n", " segment_id = token_type_ids_example[:,0].tolist()[i]\n", " col_id = token_type_ids_example[:,1].tolist()[i] - 1\n", " row_id = token_type_ids_example[:,2].tolist()[i] - 1\n", " if row_id >= 0 and col_id >= 0 and segment_id == 1:\n", " model_label_ids[i] = int(coords_to_answer[(col_id, row_id)])\n", "\n", " # set the prev label ids of the example (shape (1, seq_len) )\n", " token_type_ids_example[:,3] = torch.from_numpy(model_label_ids).type(torch.long).to(device) \n", "\n", " prev_answers = {}\n", " # get the example\n", " input_ids_example = input_ids[idx] # shape (seq_len,)\n", " attention_mask_example = attention_mask[idx] # shape (seq_len,)\n", " token_type_ids_example = token_type_ids[idx] # shape (seq_len, 7)\n", " # forward pass to obtain the logits\n", " outputs = model(input_ids=input_ids_example.unsqueeze(0), \n", " attention_mask=attention_mask_example.unsqueeze(0), \n", " token_type_ids=token_type_ids_example.unsqueeze(0))\n", " logits = outputs.logits\n", " all_logits.append(logits)\n", "\n", " # convert logits to probabilities (which are of shape (1, seq_len))\n", " dist_per_token = torch.distributions.Bernoulli(logits=logits)\n", " probabilities = dist_per_token.probs * attention_mask_example.type(torch.float32).to(dist_per_token.probs.device) \n", "\n", " # Compute average probability per cell, aggregating over tokens.\n", " # Dictionary maps coordinates to a list of one or more probabilities\n", " coords_to_probs = collections.defaultdict(list)\n", " prev_answers = {}\n", " for i, p in enumerate(probabilities.squeeze().tolist()):\n", " segment_id = token_type_ids_example[:,0].tolist()[i]\n", " col = token_type_ids_example[:,1].tolist()[i] - 1\n", " row = token_type_ids_example[:,2].tolist()[i] - 1\n", " if col >= 0 and row >= 0 and segment_id == 1:\n", " coords_to_probs[(col, row)].append(p)\n", "\n", " # Next, map cell coordinates to 1 or 0 (depending on whether the mean prob of all cell tokens is > 0.5)\n", " coords_to_answer = {}\n", " for key in coords_to_probs:\n", " coords_to_answer[key] = np.array(coords_to_probs[key]).mean() > 0.5\n", " prev_answers[idx+1] = coords_to_answer\n", " \n", " logits_batch = torch.cat(tuple(all_logits), 0)\n", " \n", " return logits_batch" ] }, { "cell_type": "markdown", "metadata": { "id": "NWxAxbAybuUI" }, "source": [ "#βœ… Example dataframe\n", "This data is fake and it's just for explanatory purposes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "id0LtVjD5MYs" }, "outputs": [], "source": [ "data = {'Company': [\"JACKSON, INC.\", \"RETRO COMP, CORP.\", \"DEEPAI, Inc.\"], \n", " 'Revenue': [\"5600000\", \"45000000\", \"59000000\"],\n", " 'Share Percentage': [\"1%\", \"2%\", \"3%\"],\n", " 'Founding Date': [\"7 february 1967\", \"10 june 1996\", \"28 november 1967\"]}\n", " \n", "queries = [\"Which is the Company where Founding Date is 7 february 1967\",\n", " \"What is the max Revenue\",\n", " \"What is the lowest Share Percentage\",\n", " \"What are the Founding Dates\",\n", " \"What is the min Share Percentage\",\n", " \"What is the max Revenue\",\n", " \"How many companies are there\"]\n", "\n", "table = pd.DataFrame.from_dict(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0D7HShzTi8-H", "outputId": "ddf7c40a-b459-4de6-92f0-050b480cf1b0" }, "outputs": [ { "data": { "text/plain": [ "TapasForQuestionAnswering(\n", " (tapas): TapasModel(\n", " (embeddings): TapasEmbeddings(\n", " (word_embeddings): Embedding(30522, 768, padding_idx=0)\n", " (position_embeddings): Embedding(1024, 768)\n", " (token_type_embeddings_0): Embedding(3, 768)\n", " (token_type_embeddings_1): Embedding(256, 768)\n", " (token_type_embeddings_2): Embedding(256, 768)\n", " (token_type_embeddings_3): Embedding(2, 768)\n", " (token_type_embeddings_4): Embedding(256, 768)\n", " (token_type_embeddings_5): Embedding(256, 768)\n", " (token_type_embeddings_6): Embedding(10, 768)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (encoder): TapasEncoder(\n", " (layer): ModuleList(\n", " (0): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (1): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (2): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (3): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (4): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (5): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (6): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (7): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (8): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (9): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (10): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (11): TapasLayer(\n", " (attention): TapasAttention(\n", " (self): TapasSelfAttention(\n", " (query): Linear(in_features=768, out_features=768, bias=True)\n", " (key): Linear(in_features=768, out_features=768, bias=True)\n", " (value): Linear(in_features=768, out_features=768, bias=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " (output): TapasSelfOutput(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " (intermediate): TapasIntermediate(\n", " (dense): Linear(in_features=768, out_features=3072, bias=True)\n", " (intermediate_act_fn): GELUActivation()\n", " )\n", " (output): TapasOutput(\n", " (dense): Linear(in_features=3072, out_features=768, bias=True)\n", " (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " )\n", " )\n", " )\n", " )\n", " (pooler): TapasPooler(\n", " (dense): Linear(in_features=768, out_features=768, bias=True)\n", " (activation): Tanh()\n", " )\n", " )\n", " (dropout): Dropout(p=0.1, inplace=False)\n", " (aggregation_classifier): Linear(in_features=768, out_features=6, bias=True)\n", ")" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "# GPU training\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "model.to(device)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pjKeNADhFkhf" }, "outputs": [], "source": [ "inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors=\"pt\")\n", "logits = compute_prediction_sequence(model, inputs, device)\n", "device = torch.device(\"cpu\")\n", "model.to(device)\n", "outputs = model(input_ids=inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], token_type_ids=inputs[\"token_type_ids\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ydcTiCd25V-8" }, "outputs": [], "source": [ "predicted_answer_coordinates, = tokenizer.convert_logits_to_predictions(inputs, logits.cpu().detach())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 451 }, "id": "HBs2yfulEcpr", "outputId": "211f0327-33e4-4561-cad4-c4025256dc5f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Does it has logits for aggregations? True\n", "[0, 4, 5, 0, 5, 4, 3]\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CompanyRevenueShare PercentageFounding Date
0JACKSON, INC.56000001%7 february 1967
1RETRO COMP, CORP.450000002%10 june 1996
2DEEPAI, Inc.590000003%28 november 1967
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Company Revenue Share Percentage Founding Date\n", "0 JACKSON, INC. 5600000 1% 7 february 1967\n", "1 RETRO COMP, CORP. 45000000 2% 10 june 1996\n", "2 DEEPAI, Inc. 59000000 3% 28 november 1967" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Which is the Company where Founding Date is 7 february 1967\n", "Predicted answer: JACKSON, INC.\n", "What is the max Revenue\n", "Predicted answer: MAX > 5600000, 45000000, 59000000\n", "What is the lowest Share Percentage\n", "Predicted answer: MIN > 1%, 2%, 3%\n", "What are the Founding Dates\n", "Predicted answer: 7 february 1967, 10 june 1996, 28 november 1967\n", "What is the min Share Percentage\n", "Predicted answer: MIN > 1%, 2%, 3%\n", "What is the max Revenue\n", "Predicted answer: MAX > 5600000, 45000000, 59000000\n", "How many companies are there\n", "Predicted answer: COUNT > JACKSON, INC., RETRO COMP, CORP., DEEPAI, Inc.\n" ] } ], "source": [ "# Some models (example, trained with SQA mechanism) don't have aggregations.\n", "has_aggregation = \"logits_aggregation\" in outputs\n", "print(f\"Does it has logits for aggregations? {str(has_aggregation)}\")\n", "show_results(model, outputs[\"logits\"].detach(), outputs[\"logits_aggregation\"].detach() if has_aggregation else None, queries)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wcf1_Co8T-Zc" }, "outputs": [], "source": [ "MODEL_NAME = 'tapas_jsl'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3CPKvLZOTX4q" }, "outputs": [], "source": [ "model.save_pretrained(save_directory=MODEL_NAME,\n", " is_main_process=True, state_dict=model.state_dict())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pmhSP6u2W0Zt" }, "outputs": [], "source": [ "! rm -Rf google" ] }, { "cell_type": "markdown", "metadata": { "id": "kiqoCyTvSrDu" }, "source": [ "#βœ… Saving in TensorFlow Format" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_sOe3gNLSs1L" }, "outputs": [], "source": [ "from transformers import TFTapasForQuestionAnswering\n", "#Auxiliary class for exporting TF graph from HF\n", "class JSLTapas(TFTapasForQuestionAnswering):\n", " @tf.function(\n", " input_signature=[\n", " {\n", " \"input_ids\": tf.TensorSpec((None, None), tf.int32, name=\"input_ids\"),\n", " \"attention_mask\": tf.TensorSpec((None, None), tf.int32, name=\"attention_mask\"),\n", " \"token_type_ids\": tf.TensorSpec((None, None, 7), tf.int32, name=\"token_type_ids\"),\n", " }\n", " ]\n", " )\n", " def serving(self, inputs): \n", " outputs = self.call(\n", " input_ids=inputs[\"input_ids\"],\n", " attention_mask=inputs[\"attention_mask\"],\n", " token_type_ids=inputs[\"token_type_ids\"]\n", " )\n", " \n", " if not self._has_logits:\n", " outputs.logits_aggregation = tf.zeros((tf.shape(outputs.logits)[0], 1))\n", " \n", " return self.serving_output(outputs) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8Y4m1M-zTqpJ", "outputId": "1a93d205-797d-4b68-d912-163a4cbc9e77" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "All PyTorch model weights were used when initializing TFTapasForQuestionAnswering.\n", "\n", "All the weights of TFTapasForQuestionAnswering were initialized from the PyTorch model.\n", "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFTapasForQuestionAnswering for predictions without further training.\n" ] } ], "source": [ "loaded_model = TFTapasForQuestionAnswering.from_pretrained(MODEL_NAME, from_pt=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BKJH2TxVTLxb", "outputId": "1ec8628c-bf8e-4186-85ba-d820906c76c4" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "All PyTorch model weights were used when initializing JSLTapas.\n", "\n", "All the weights of JSLTapas were initialized from the PyTorch model.\n", "If your task is similar to the task the model of the checkpoint was trained on, you can already use JSLTapas for predictions without further training.\n" ] } ], "source": [ "jsl_model = JSLTapas.from_pretrained(MODEL_NAME, from_pt=True)\n", "jsl_model._has_logits = has_aggregation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "51itsLS4UftJ" }, "outputs": [], "source": [ "TF_TMP_LOCATION=\"tmp\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wjQG6tH8UdwC", "outputId": "fce7025c-5dcf-435d-a9ba-f63aedf02eb5" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:Found untraced functions such as compute_column_logits_layer_call_fn, compute_column_logits_layer_call_and_return_conditional_losses, embeddings_layer_call_fn, embeddings_layer_call_and_return_conditional_losses, encoder_layer_call_fn while saving (showing 5 of 422). These functions will not be directly callable after loading.\n" ] }, { "data": { "text/plain": [ "('tmp/assets/vocab.txt',)" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Export graph\n", "tf.saved_model.save(jsl_model, TF_TMP_LOCATION, signatures={\n", " \"serving_default\": jsl_model.serving\n", "})\n", "#Save voculary\n", "tokenizer.save_vocabulary(f\"{TF_TMP_LOCATION}/assets\")" ] }, { "cell_type": "markdown", "metadata": { "id": "tY6iK8v_Uul3" }, "source": [ "#βœ… Import into Spark NLP" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aEIiD-I4Uvhg" }, "outputs": [], "source": [ "! pip install johnsnowlabs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QD0vqdPy29ar" }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance\n", "\n", "nlp.install(force_browser=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Xg3jId5YU-ch", "outputId": "aa9eaa9f-3296-4376-cabb-3288590d1413" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Spark Session already created, some configs may not take.\n" ] } ], "source": [ "spark = nlp.start()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "fNu0AnLAVTnG", "outputId": "163a1283-1924-423c-ee29-1d15021b5846" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CompanyRevenueShare PercentageFounding Date
0JACKSON, INC.56000001%7 february 1967
1RETRO COMP, CORP.450000002%10 june 1996
2DEEPAI, Inc.590000003%28 november 1967
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Company Revenue Share Percentage Founding Date\n", "0 JACKSON, INC. 5600000 1% 7 february 1967\n", "1 RETRO COMP, CORP. 45000000 2% 10 june 1996\n", "2 DEEPAI, Inc. 59000000 3% 28 november 1967" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Uhudt8mSVUzF", "outputId": "9da9021f-303b-43f7-f4bb-6ff27b263629" }, "outputs": [ { "data": { "text/plain": [ "['Which is the Company where Founding Date is 7 february 1967',\n", " 'What is the max Revenue',\n", " 'What is the lowest Share Percentage',\n", " 'What are the Founding Dates',\n", " 'What is the min Share Percentage',\n", " 'What is the max Revenue',\n", " 'How many companies are there']" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "queries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SS8-shepV_n5" }, "outputs": [], "source": [ "json_data = \"\"\"\n", "{\n", " \"header\": [\"Company\", \"Revenue\", \"Share Percentage\", \"Founding Date\"],\n", " \"rows\": [\n", " [\"JACKSON, INC.\", \"56000000\", \"1%\", \"7 february 1967\"],\n", " [\"RETRO COMP, CORP.\", \"450000000\", \"2%\", \"10 june 1996\"],\n", " [\"DEEPAI, Inc.\", \"590000000\", \"3%\", \"28 november 1967\"],\n", " ]\n", "}\n", "\"\"\"\n", "\n", "queries = [\"Which is the Company where Founding Date is 7 february 1967\",\n", " \"What is the max Revenue\",\n", " \"What is the lowest Share Percentage\",\n", " \"What are the Founding Dates\",\n", " \"What is the min Share Percentage\",\n", " \"What is the max Revenue\",\n", " \"How many companies are there\"]\n", "\n", "data = spark.createDataFrame([\n", " [json_data, \" \".join(queries)]\n", " ]).toDF(\"table_json\", \"questions\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2YwTgFeXWY1J", "outputId": "7d862bd0-48de-4132-904b-4358494f936a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------------------+\n", "| table_json| questions|\n", "+--------------------+--------------------+\n", "|\n", "{\n", " \"header\": [\"...|Which is the Comp...|\n", "+--------------------+--------------------+\n", "\n" ] } ], "source": [ "data.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-ofLmwYNWeiG" }, "outputs": [], "source": [ "SPARKNLP_MODEL_LOCATION = \"tapas_jsl_spark_nlp\"\n", "\n", "MODEL_NAME = \"google/tapas-base\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E0G-8jsjVeko" }, "outputs": [], "source": [ "tokenizer = TapasTokenizer.from_pretrained(MODEL_NAME)\n", "case_sensitive = not tokenizer.do_lower_case\n", "\n", "nlp.TapasForQuestionAnswering\\\n", " .loadSavedModel(TF_TMP_LOCATION, spark)\\\n", " .setCaseSensitive(case_sensitive)\\\n", " .write().overwrite()\\\n", " .save(SPARKNLP_MODEL_LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "fyLpBr2zW6z5" }, "source": [ "#βœ… Checking in Spark NLP" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lXfJP-zXW9CU" }, "outputs": [], "source": [ "document_assembler = nlp.MultiDocumentAssembler() \\\n", " .setInputCols(\"table_json\", \"questions\") \\\n", " .setOutputCols(\"document_table\", \"document_questions\")\n", "\n", "text_splitter = finance.TextSplitter() \\\n", " .setInputCols([\"document_questions\"]) \\\n", " .setOutputCol(\"questions\")\n", "\n", "table_assembler = nlp.TableAssembler()\\\n", " .setInputCols([\"document_table\"])\\\n", " .setOutputCol(\"table\")\n", "\n", "tapas = nlp.TapasForQuestionAnswering\\\n", " .load(SPARKNLP_MODEL_LOCATION)\\\n", " .setInputCols([\"questions\", \"table\"])\\\n", " .setOutputCol(\"answers\")\n", "\n", "pipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " text_splitter,\n", " table_assembler,\n", " tapas\n", "])\n", "\n", "fit_model = pipeline.fit(data)\n", "fit_model\\\n", " .transform(data)\\\n", " .selectExpr(\"explode(answers) AS answer\")\\\n", " .select(\"answer\")\\\n", " .show(truncate=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "0ZRnIm1DYEXv" }, "source": [ "#βœ… Upload it to Spark NLP Models Hub! πŸš€\n", "https://modelshub.johnsnowlabs.com/" ] }, { "cell_type": "code", "source": [], "metadata": { "id": "o51VCzS-fRcu" }, "execution_count": null, "outputs": [] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [ "0ZRnIm1DYEXv" ], "machine_shape": "hm", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "31e3393e0d3e465d82dcff62b477c9ae": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "4e3e8c07211a4f12b4f86c620447f9da": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "5f5e6568c5c14e3e9fc0cb602dcca524": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_e4fea04935234ff3a789052f9496029e", "max": 1, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_6e9a664c7ace4030ba7a3364212573c8", "value": 0 } }, "651f654c99924d87a3ccef7f57eb62c2": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "6e9a664c7ace4030ba7a3364212573c8": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "8f88821673844721a988714cb4cc4b0b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_e5b7be2ee2af40438e6103b126d62a13", "IPY_MODEL_5f5e6568c5c14e3e9fc0cb602dcca524", "IPY_MODEL_ab171a0a4d6b45f0bf5b2afde61d2a0d" ], "layout": "IPY_MODEL_b7fc543eee094a33aca45e77ca9e757a" } }, "ab171a0a4d6b45f0bf5b2afde61d2a0d": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_4e3e8c07211a4f12b4f86c620447f9da", "placeholder": "​", "style": "IPY_MODEL_651f654c99924d87a3ccef7f57eb62c2", "value": " 0/0 [00:00<?, ?it/s]" } }, "b7fc543eee094a33aca45e77ca9e757a": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "e49f036193854b1fbd2126578be5d39f": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "e4fea04935234ff3a789052f9496029e": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": "20px" } }, "e5b7be2ee2af40438e6103b126d62a13": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_e49f036193854b1fbd2126578be5d39f", "placeholder": "​", "style": "IPY_MODEL_31e3393e0d3e465d82dcff62b477c9ae", "value": "" } } } } }, "nbformat": 4, "nbformat_minor": 0 }