{"cells":[{"cell_type":"markdown","id":"db5f4f9a-7776-42b3-8758-85624d4c15ea","metadata":{"id":"db5f4f9a-7776-42b3-8758-85624d4c15ea"},"source":[""]},{"cell_type":"markdown","id":"21e9eafb","metadata":{"id":"21e9eafb"},"source":["[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/08.0.Answering_Questions_Financial_Texts.ipynb)"]},{"cell_type":"markdown","id":"9859b3bc-cec4-4189-88ed-37add5484623","metadata":{"id":"9859b3bc-cec4-4189-88ed-37add5484623"},"source":["#🔎 Answering Questions on Financial Texts\n","📜One of the latests biggest outcomes in NLP are **Language Models** and their ability to answer questions, expressed in natural language."]},{"cell_type":"markdown","id":"__BAKoJW8zVv","metadata":{"id":"__BAKoJW8zVv"},"source":["> *While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020...\n","...\n","We reported an operating loss of approxiamtely \\$8,048,581 million in 2020 as compared to an operating loss of \\$7,738,193 in 2019\n","...*\n","\n","\n"]},{"cell_type":"markdown","id":"uyqNQOoX8wgw","metadata":{"id":"uyqNQOoX8wgw"},"source":["```\n","- What is the profit increase?\n","- What was the decline in revenue?\n","- What was the operation loss in 2020?\n","- What was the operation loss in 2019?\n","```"]},{"cell_type":"markdown","id":"elX0hFQlEbuY","metadata":{"id":"elX0hFQlEbuY"},"source":[""]},{"cell_type":"markdown","id":"uvJsv20T7EPW","metadata":{"id":"uvJsv20T7EPW"},"source":["📜\n","\n","**Question Answeering (QA)** uses specific Language Models trained to carry out **Natural Language Inference (NLI)**\n","\n","**NLI** works as follows:\n","- Given a text as a Premise (P);\n","- Given a hypotheses (H) as a question to be solved;\n"," - Then, we ask the Language Model is H is `entailed`, `contradicted` or `not related` in P. \n"]},{"cell_type":"markdown","id":"uenfXatl-dAR","metadata":{"id":"uenfXatl-dAR"},"source":[""]},{"cell_type":"markdown","id":"fyNazopzASEM","metadata":{"id":"fyNazopzASEM"},"source":["Although we are not getting into the maths of it, it's basically done by using a Language Model to encode P, H and then carry out sentence similarity operations."]},{"cell_type":"markdown","id":"zjRUEO9SAQc2","metadata":{"id":"zjRUEO9SAQc2"},"source":[""]},{"cell_type":"markdown","id":"HMYsdvT2Ao3D","metadata":{"id":"HMYsdvT2Ao3D"},"source":["##📌 Applications of NLI: The basics\n","The most straight-forward, retrieving answers to natural language questions.\n"," - Type 1: Open-book questions, where you give the text (P) to the model.\n"," - Type 2: Close-book questions, where you just use the pretrained Language Model capabilities, learn on texts during training time."]},{"cell_type":"markdown","id":"xmtzclv4BHEX","metadata":{"id":"xmtzclv4BHEX"},"source":["##📌 Applications of NLI: Zero-shot\n","At John Snow Labs, we have developed our own annotators based on NLI, to not only carry out Question Answering, but using QA to:\n","- Retrieve Entities, also know as Zero-shot NER;\n","- Retrieve Relations, also known as Zero-shot Relation Extraction;"]},{"cell_type":"markdown","id":"CJs7rSAWBeLh","metadata":{"id":"CJs7rSAWBeLh"},"source":["###✔️ How we achieve Zero-shot NER With QA?\n","Given a Question Q, for example, `What was the profit increase in 2017?`, and given the text P `In 2017, the Company reported a profit decline of $4 million dollars compared to 2016` we:\n","\n","- Generate Hypotheses H with the tokens of the text\n"," - The profit increase in 2017 was 2017: `contradiction`\n"," - The profit increase in 2017 was Company: `contradiction`\n"," - The profit increase in 2017 was ...: `contradiction`\n"," - The profit increase in 2017 was $4: `entailment`\n"," - The profit increase in 2017 was million: `entailment`\n","\n","- We check all the H towards P to see if they are `entailed`. If so, we return them as NER entity. If several tokens in a row return `entailed`, we check if they can be part of the same chunk."]},{"cell_type":"markdown","id":"R_DqE80WF70u","metadata":{"id":"R_DqE80WF70u"},"source":[""]},{"cell_type":"markdown","id":"rJ0D69AzCiPg","metadata":{"id":"rJ0D69AzCiPg"},"source":["Let's take a look at some examples of applications of QA to Financial Texts."]},{"cell_type":"markdown","id":"4iIO6G_B3pqq","metadata":{"collapsed":false,"id":"4iIO6G_B3pqq"},"source":["#🎬 Installation"]},{"cell_type":"code","execution_count":null,"id":"hPwo4Czy3pqq","metadata":{"id":"hPwo4Czy3pqq","pycharm":{"is_executing":true}},"outputs":[],"source":["! pip install -q johnsnowlabs"]},{"cell_type":"markdown","id":"YPsbAnNoPt0Z","metadata":{"id":"YPsbAnNoPt0Z"},"source":["##🔗 Automatic Installation\n","Using my.johnsnowlabs.com SSO"]},{"cell_type":"code","execution_count":2,"id":"_L-7mLYp3pqr","metadata":{"id":"_L-7mLYp3pqr","pycharm":{"is_executing":true},"executionInfo":{"status":"ok","timestamp":1686068159407,"user_tz":-180,"elapsed":1390,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[],"source":["from johnsnowlabs import nlp, finance\n","\n","# nlp.install(force_browser=True)"]},{"cell_type":"markdown","id":"hsJvn_WWM2GL","metadata":{"id":"hsJvn_WWM2GL"},"source":["##🔗 Manual downloading\n","If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n","\n","- Go to my.johnsnowlabs.com\n","- Download your license\n","- Upload it using the following command"]},{"cell_type":"code","execution_count":null,"id":"i57QV3-_P2sQ","metadata":{"id":"i57QV3-_P2sQ"},"outputs":[],"source":["from google.colab import files\n","print('Please Upload your John Snow Labs License using the button below')\n","license_keys = files.upload()"]},{"cell_type":"markdown","id":"xGgNdFzZP_hQ","metadata":{"id":"xGgNdFzZP_hQ"},"source":["- Install it"]},{"cell_type":"code","execution_count":null,"id":"OfmmPqknP4rR","metadata":{"id":"OfmmPqknP4rR"},"outputs":[],"source":["nlp.install()"]},{"cell_type":"markdown","id":"DCl5ErZkNNLk","metadata":{"id":"DCl5ErZkNNLk"},"source":["#📌 Starting"]},{"cell_type":"code","execution_count":5,"id":"x3jVICoa3pqr","metadata":{"id":"x3jVICoa3pqr","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1686068279445,"user_tz":-180,"elapsed":8737,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}},"outputId":"e977590b-877e-49ae-b0f1-6da179bb7086"},"outputs":[{"output_type":"stream","name":"stdout","text":["👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (6).json\n","👌 Launched \u001b[92mcpu optimized\u001b[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.2, running on ⚡ PySpark==3.1.2\n"]}],"source":["spark = nlp.start()"]},{"cell_type":"markdown","id":"dB2JSW4NDIqj","metadata":{"id":"dB2JSW4NDIqj"},"source":["#🔎 Open Book Questions"]},{"cell_type":"code","execution_count":null,"id":"Ynw9MOlEDec1","metadata":{"id":"Ynw9MOlEDec1"},"outputs":[],"source":["! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt"]},{"cell_type":"code","execution_count":null,"id":"aw0j68WtDhvg","metadata":{"id":"aw0j68WtDhvg"},"outputs":[],"source":["with open('cdns-20220101.html.txt', 'r') as f:\n"," cadence_sec10k = f.read()"]},{"cell_type":"markdown","id":"qYe1sY19I8w8","metadata":{"id":"qYe1sY19I8w8"},"source":["Let's take a random piece of text from our 10-K filing..."]},{"cell_type":"code","execution_count":null,"id":"aNTvwdWyD-ko","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"aNTvwdWyD-ko","outputId":"e7e0dc48-360f-4b94-acb4-9b7e9c6b4815","executionInfo":{"status":"ok","timestamp":1685791294398,"user_tz":-180,"elapsed":3,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["necessary, on commercially reasonable terms or at all and, even if successful, those alternative actions may not allow us to meet our scheduled debt service obligations. The agreement governing our revolving credit facility restricts our ability to dispose of assets and use the proceeds from those dispositions and may also restrict our ability to raise debt or equity capital to be used to repay other indebtedness when it becomes due. We may not be able to consummate those dispositions or to obtain proceeds in an amount sufficient to meet any debt service obligations then due. \n","In addition, we conduct a substantial portion of our operations through our subsidiaries, none of which are currently guarantors of our indebtedness. Accordingly, repayment of our indebtedness is dependent on the generation of cash flow by our subsidiaries and their ability to make such cash available to us, by dividend, debt repayment or otherwise. Our subsidiaries do not have any obligation to pay amounts due on our indebtedness or to make funds available for that purpose. Our subsidiaries may not be able to, or may not be permitted to, make distributions to enable us to make payments in respect of our indebtedness. Each subsidiary is a distinct legal entity, and, under certain circumstances, legal and contractual restrictions may limit our ability to obtain cash from our subsidiaries. In the event that we do not receive distributions from our subsidiaries, we may be unable to make required principal and interest payments on our indebtedness.\n","24\n","Table of Contents\n","If we cannot make scheduled payments on our debt, we will be in default and holders of our debt could declare all outstanding principal and interest to be due and payable, the lenders under our revolving credit facility could terminate their commitments to loan money and we could be forced into bankruptcy or liquidation. In addition, a material default on our indebtedness could suspend our eligibility to register securities using certain registration statement forms under SEC guidelines that permit incorporation by reference of substantial information regarding us, potentially hindering our ability to raise capital through the issuance of our securities and increasing our costs of registration.\n","Despite our current level of indebtedness, we and our subsidiaries may incur substantially more debt. This could further exacerbate the risks to our financial condition described above.\n","We and our subsidiaries may incur significant additional indebtedness in the future. Although the agreement governing our revolving credit facility contains restrictions on the incurrence of additional indebtedness, these restrictions are subject to a number of qualifications and exceptions, and the additional indebtedness incurred in compliance with these restrictions could be substantial. If we incur any additional indebtedness that ranks equally with the 2024 Notes, then subject to any collateral arrangements we may enter into, the holders of that debt will be entitled to share ratably in any proceeds distributed in connection with any insolvency, liquidation, reorganization, dissolution or other winding up of our company. \n","Our variable rate indebtedness subjects us to interest rate risk, which could cause our debt service obligations to increase significantly.\n","Borrowings under our revolving credit facility are at variable rates of interest and expose us to interest rate risk. If interest rates were to increase, our debt service obligations on our variable rate indebtedness would increase even though the amount borrowed remained the same, and our net income and cash flows, including cash available for servicing our indebtedness, would correspondingly decrease. In the future, we may enter into interest rate swaps that involve the exchange of floating for fixed rate interest payments in order to reduce interest rate volatility. However, we may not maintain interest rate swaps with respect to all of our variable rate indebtedness, and any swaps we enter into may not fully mitigate our interest rate risk.\n","Our revolving credit facility utilizes, at our option, either (1) LIBOR, plus a margin of between 0.750% and 1.250%, determined by reference to the credit rating of our unsecured debt, or (2) base rate plus a margin of 0.000% to 0.250%, determined by reference to the credit rating of our unsecured debt, to calculate the amount of accrued interest on any borrowings. Regulators in certain jurisdictions including the United Kingdom and the United States have begun to phase out the use of LIBOR, ceasing publication for certain tenors of the U.S. dollar (and other) LIBOR at the end of 2021, with plans to cease publication for the remaining tenors of U.S. dollar LIBOR beginning June 30, 2023. Our revolving credit facility contains provisions that contemplate the transition from LIBOR under specified events; however, the transition from LIBOR to a new replacement benchmark remains uncertain at this time and the consequences of such developments cannot be entirely predicted, but could result in an increase in the cost of our borrowings under our existing credit facility and any future borrowings.\n","In addition, our revolving credit facility uses a pricing grid based on our credit ratings. If our credit ratings are downgraded or other negative action is taken, the interest rate payable by us under our revolving credit facility would increase. Credit rating downgrades could also restrict our ability to obtain additional financing in the future and affect the terms of any such financing.\n","Various factors could increase our future borrowing costs or reduce our access to capital, including a lowering or withdrawal of the ratings assigned to us and our 2024 Notes by credit rating agencies.\n","We may in the future seek additional financing for a variety of reasons, and our future borrowing costs, terms and access to capital could be affected by factors including the condition of the debt and equity markets, the condition of the economy generally, prevailing interest rates, our level of indebtedness, our credit rating and our business and financial condition. In addition, the 2024 Notes currently have an investment grade credit rating, which could be lowered or withdrawn entirely by a credit rating agency based on adverse changes to circumstances relating to the basis of the credit rating. Consequently, real or anticipated changes in our credit ratings will generally affect the market value of the 2024 Notes. Any future lowering of the credit ratings of the 2024 Notes likely would make it more difficult or more expensive for us to obtain additional debt financing. \n","\n","Item 1B. Unresolved Staff Comments\n","None.\n","\n","Item 2. Properties\n","We own land and buildings at our headquarters located in San Jose, California. We also own buildings in India. As of January 1, 2022, the total square footage of our owned buildings was approximately 1,010,000.\n","We lease additional facilities in the United States and various other countries. We may sublease certain of these facilities where space is not fully utilized.\n","We believe that these facilities are adequate for our current needs and that suitable additional or substitute space will be available as needed to accommodate any expansion of our operations.\n","\n","25\n","Table of Contents\n","Item 3. Legal Proceedings\n","From time to time, we are involved in various disputes and legal proceedings that arise in the ordinary course of business. These include disputes and legal proceedings related to intellectual property, indemnification obligations, mergers and acquisitions, licensing, contracts, customers, products, distribution and other commercial arrangements and employee relations matters. At least quarterly, we review the status of each significant matter and assess its potential financial exposure. If the potential loss from any claim or legal proceeding is considered probable and the amount or the range of loss can be estimated, we accrue a liability for the estimated loss. Legal proceedings are subject to uncertainties, and the outcomes are difficult to predict. Because of such uncertainties, accruals are based on our judgments using the best information available at the time. As additional information becomes available, we reassess the potential liability related to pending claims and legal proceedings and may revise estimates.\n"," \n","Item 4. Mine Safety Disclosures\n","Not applicable.\n","\n","\n","26\n","Table of Contents\n","PART II. \n","\n","Item 5. Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities\n","Our common stock is traded on the Nasdaq Global Select Market under the symbol CDNS. As of February 5, 2022, we had 384 registered stockholders and approximately 340,000 beneficial owners of our common stock.\n","\n","Stockholder Return Performance Graph\n","The following graph compares the cumulative 5-year total stockholder return on our common stock relative to the cumulative total return of the Nasdaq Composite Index,\n"]}],"source":["random_piece = cadence_sec10k[135000:144000]\n","print(random_piece)"]},{"cell_type":"markdown","id":"uuw6cjWPJWHe","metadata":{"id":"uuw6cjWPJWHe"},"source":["Items 2,3, and 5 seem good to ask questions about them!"]},{"cell_type":"code","execution_count":null,"id":"cRS6OPrtJPQs","metadata":{"id":"cRS6OPrtJPQs"},"outputs":[],"source":["item2 = \"\"\"We own land and buildings at our headquarters located in San Jose, California. We also own buildings in India. As of January 1, 2022, the total square footage of our owned buildings was approximately 1,010,000.\n","We lease additional facilities in the United States and various other countries. We may sublease certain of these facilities where space is not fully utilized.\"\"\"\n","\n","item3 = \"\"\"From time to time, we are involved in various disputes and legal proceedings that arise in the ordinary course of business. These include disputes and legal proceedings related to intellectual property, indemnification obligations, mergers and acquisitions, licensing, contracts, customers, products, distribution and other commercial arrangements and employee relations matters. At least quarterly, we review the status of each significant matter and assess its potential financial exposure. If the potential loss from any claim or legal proceeding is considered probable and the amount or the range of loss can be estimated, we accrue a liability for the estimated loss. Legal proceedings are subject to uncertainties, and the outcomes are difficult to predict. Because of such uncertainties, accruals are based on our judgments using the best information available at the time. As additional information becomes available, we reassess the potential liability related to pending claims and legal proceedings and may revise estimates.\"\"\"\n","\n","item5 = \"\"\"Our common stock is traded on the Nasdaq Global Select Market under the symbol CDNS. As of February 5, 2022, we had 384 registered stockholders and approximately 340,000 beneficial owners of our common stock.\"\"\""]},{"cell_type":"markdown","id":"gIsVya5JERsH","metadata":{"id":"gIsVya5JERsH"},"source":["##🚀 Let's create a pipeline\n","We will use a `RoBerta` based QA model named `finqa_roberta`\n"]},{"cell_type":"markdown","id":"3H4Mk01YL0Qt","metadata":{"id":"3H4Mk01YL0Qt"},"source":["📜To do that, we use in our pipelines:\n","- a `MultiDocumentAssembler`, which puts together questions (Q to create H) and context (P).\n","- a BertForQuestionAnswering pretrained model. \n","\n","🚀**IMPORTANT: We highly recommend to use `setCaseSensitive(False)` to prevent uppercase to be managed as proper nouns and possibly trigger OOV.**"]},{"cell_type":"code","execution_count":null,"id":"PrEa2a9lEorE","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"PrEa2a9lEorE","outputId":"c356e156-1978-4431-9b24-aa43653dfb5c","executionInfo":{"status":"ok","timestamp":1685791450671,"user_tz":-180,"elapsed":61895,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["finqa_bert download started this may take some time.\n","Approximate size to download 389 MB\n","[OK!]\n"]}],"source":["documentAssembler = nlp.MultiDocumentAssembler()\\\n"," .setInputCols([\"question\", \"context\"])\\\n"," .setOutputCols([\"document_question\", \"document_context\"])\n","\n","spanClassifier = nlp.BertForQuestionAnswering.pretrained(\"finqa_bert\",\"en\", \"finance/models\") \\\n"," .setInputCols([\"document_question\", \"document_context\"]) \\\n"," .setOutputCol(\"answer\") \\\n"," .setCaseSensitive(False)\n","\n","qa_pipeline = nlp.Pipeline().setStages([\n"," documentAssembler,\n"," spanClassifier\n","])"]},{"cell_type":"code","execution_count":null,"id":"6rIUTqmoEwgu","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"6rIUTqmoEwgu","outputId":"c0d74078-c141-43b9-e515-128e51057b20","executionInfo":{"status":"ok","timestamp":1685791456816,"user_tz":-180,"elapsed":6151,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+--------------------+--------------------+\n","| question| context|\n","+--------------------+--------------------+\n","|Where are the hea...|We own land and b...|\n","|What is the total...|We own land and b...|\n","|In which countrie...|We own land and b...|\n","+--------------------+--------------------+\n","\n"]}],"source":["P = item2\n","\n","Q = [\n"," \"Where are the headquarters?\",\n"," \"What is the total square footage?\",\n"," \"In which countries do they lease facilities?\"\n","]\n","\n","Q_P = [ [q, P] for q in Q]\n","\n","example = spark.createDataFrame(Q_P).toDF(\"question\", \"context\")\n","\n","example.show()"]},{"cell_type":"code","execution_count":null,"id":"KQoGvQOlIGTV","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"KQoGvQOlIGTV","outputId":"1a9c7c9d-e29b-4689-fdfe-d3b66ab393dd","executionInfo":{"status":"ok","timestamp":1685791474121,"user_tz":-180,"elapsed":10279,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+--------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|question |result |answer |\n","+--------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|Where are the headquarters? |[San Jose , California]|[{chunk, 0, 20, San Jose , California, {chunk -> 0, start_score -> 0.8019189, score -> 0.83842176, end -> 20, start -> 17, end_score -> 0.8749246, sentence -> 0}, []}]|\n","|What is the total square footage? |[1 , 010 , 000] |[{chunk, 0, 12, 1 , 010 , 000, {chunk -> 0, start_score -> 0.66597635, score -> 0.7811918, end -> 55, start -> 50, end_score -> 0.89640725, sentence -> 0}, []}] |\n","|In which countries do they lease facilities?|[United States] |[{chunk, 0, 12, United States, {chunk -> 0, start_score -> 0.5888994, score -> 0.52136713, end -> 64, start -> 63, end_score -> 0.4538349, sentence -> 0}, []}] |\n","+--------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["result = qa_pipeline.fit(example).transform(example)\n","\n","result.select('question', 'answer.result', 'answer').show(truncate=False)"]},{"cell_type":"code","execution_count":null,"id":"SY-DNoG8Lrgh","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"SY-DNoG8Lrgh","outputId":"4f1c0f3b-c9ef-49ca-fb38-b17c4bb9c023","executionInfo":{"status":"ok","timestamp":1685791475651,"user_tz":-180,"elapsed":1533,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+-----------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|question |result |answer |\n","+-----------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|Where is their common stock traded?|[Nasdaq Global Select Market]|[{chunk, 0, 26, Nasdaq Global Select Market, {chunk -> 0, start_score -> 0.30269945, score -> 0.41721502, end -> 21, start -> 16, end_score -> 0.5317306, sentence -> 0}, []}]|\n","|Which is the trading symbol? |[CDNS] |[{chunk, 0, 3, CDNS, {chunk -> 0, start_score -> 0.8779542, score -> 0.8447887, end -> 25, start -> 24, end_score -> 0.8116232, sentence -> 0}, []}] |\n","+-----------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["P = item5\n","\n","Q = [\n"," \"Where is their common stock traded?\",\n"," \"Which is the trading symbol?\"\n","]\n","\n","Q_P = [ [q, P] for q in Q]\n","\n","example = spark.createDataFrame(Q_P).toDF(\"question\", \"context\")\n","\n","result = qa_pipeline.fit(example).transform(example)\n","\n","result.select('question', 'answer.result', 'answer').show(truncate=False)"]},{"cell_type":"code","execution_count":null,"id":"GqWhSXuAMOJy","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"GqWhSXuAMOJy","outputId":"a2a666ea-8c93-447c-9979-b7edb0f180cc","executionInfo":{"status":"ok","timestamp":1685791486889,"user_tz":-180,"elapsed":1992,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["+------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|question |result |answer |\n","+------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","|What kind of disputes or legal proceedings related to?|[intellectual property , indemnification obligations , mergers and acquisitions , licensing , contracts , customers , products , distribution and other commercial arrangements and employee relations matters]|[{chunk, 0, 204, intellectual property , indemnification obligations , mergers and acquisitions , licensing , contracts , customers , products , distribution and other commercial arrangements and employee relations matters, {chunk -> 0, start_score -> 0.63349277, score -> 0.56178546, end -> 71, start -> 43, end_score -> 0.4900781, sentence -> 0}, []}]|\n","+------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n","\n"]}],"source":["P = item3\n","\n","Q = [\n"," \"What kind of disputes or legal proceedings related to?\"\n","]\n","\n","Q_P = [ [q, P] for q in Q]\n","\n","example = spark.createDataFrame(Q_P).toDF(\"question\", \"context\")\n","\n","result = qa_pipeline.fit(example).transform(example)\n","\n","result.select('question', 'answer.result', 'answer').show(truncate=False)"]},{"cell_type":"markdown","id":"lgTmRajbMSGc","metadata":{"id":"lgTmRajbMSGc"},"source":["#🔎 Automatic Question Generation\n","Now the question is ... is there a way to generate the questions automatically?\n","\n","The answer is simple: **YES**, there is!\n","\n","We have several ways to generate a series of questions, given for examplee:\n","- A `SUBJECT` of a sentence;\n","- An `ACTION` (verb);\n","\n","More specifically, there are three ways:\n","1. Using the grammatical information (Part of Speech and Dependency Tree);\n","2. Using NER / Contextual Parser or other method to retrieve SUBJECT and VERB\n","\n","Check the notebook \"Automatic Question Generation\" for examples of how to do it."]},{"cell_type":"markdown","id":"S9pfvS3mzVuD","metadata":{"id":"S9pfvS3mzVuD"},"source":["#🔎 Table Question Answering\n","For table question answering we have a specific notebok you will find in this workshop. Feel free to check it out too!\n","\n","But it the meantime, a small spoiler..."]},{"cell_type":"markdown","id":"C6uINimy6OUv","metadata":{"id":"C6uINimy6OUv"},"source":["#🔎 1. From csv files"]},{"cell_type":"markdown","id":"5fbldq8G6Qpp","metadata":{"id":"5fbldq8G6Qpp"},"source":["Let's create a `csv` file with information about clients and agreements."]},{"cell_type":"code","execution_count":null,"id":"SH_kpoAm7KDw","metadata":{"id":"SH_kpoAm7KDw"},"outputs":[],"source":["import pandas as pd\n","\n","df_data = { \n"," \"header\" : ['client name', 'last operation year', 'last operation amount', 'document'],\n"," \"rows\" : [ \n"," ['John Smith', '2007', '$200000', 'NDA'],\n"," ['Jack Gordon', '2017', '$10000', 'Credit Agreement'],\n"," ['Mary Lean', '2001', '$120000', 'License Agreement'],\n"," ['Jessica James', '2022', '$1200000', 'Purchase Agreement'],\n","]\n","}\n","\n","\n","df = pd.DataFrame(df_data['rows'], columns=df_data['header'])\n","\n","df.to_csv('table.csv', index=False)\n"]},{"cell_type":"code","execution_count":null,"id":"mgWwgvvoBcur","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"mgWwgvvoBcur","outputId":"e9f39ba5-05bc-4e08-d390-6ea1c712075b","executionInfo":{"status":"ok","timestamp":1685791503145,"user_tz":-180,"elapsed":287,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'header': ['client name',\n"," 'last operation year',\n"," 'last operation amount',\n"," 'document'],\n"," 'rows': [['John Smith', '2007', '$200000', 'NDA'],\n"," ['Jack Gordon', '2017', '$10000', 'Credit Agreement'],\n"," ['Mary Lean', '2001', '$120000', 'License Agreement'],\n"," ['Jessica James', '2022', '$1200000', 'Purchase Agreement']]}"]},"metadata":{},"execution_count":16}],"source":["df_data"]},{"cell_type":"code","execution_count":null,"id":"adF0OloH7gSG","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":175},"id":"adF0OloH7gSG","outputId":"1ef3df34-d9fe-454a-de38-d03e28b93676","executionInfo":{"status":"ok","timestamp":1685791504424,"user_tz":-180,"elapsed":7,"user":{"displayName":"Bünyamin Polat","userId":"03982086590103784785"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" client name last operation year last operation amount document\n","0 John Smith 2007 $200000 NDA\n","1 Jack Gordon 2017 $10000 Credit Agreement\n","2 Mary Lean 2001 $120000 License Agreement\n","3 Jessica James 2022 $1200000 Purchase Agreement"],"text/html":["\n","
| \n"," | client name | \n","last operation year | \n","last operation amount | \n","document | \n","
|---|---|---|---|---|
| 0 | \n","John Smith | \n","2007 | \n","$200000 | \n","NDA | \n","
| 1 | \n","Jack Gordon | \n","2017 | \n","$10000 | \n","Credit Agreement | \n","
| 2 | \n","Mary Lean | \n","2001 | \n","$120000 | \n","License Agreement | \n","
| 3 | \n","Jessica James | \n","2022 | \n","$1200000 | \n","Purchase Agreement | \n","