{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "wx0Sg_fRM5Kh" }, "source": [ "# Notebook [2]: Using the PDF converter" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Y5PrFhdQMeBF" }, "source": [ "\n", "\n", "This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "f58-FXmbMfjz" }, "source": [ "***Note:*** *To run this notebook you will need to have access to GPU. If you are using colab, you will need to install `cdQA` by executing `!pip install cdqa` in a cell.* " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-07-20T13:41:40.814076Z", "start_time": "2019-07-20T13:41:39.440654Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 76 }, "colab_type": "code", "collapsed": true, "id": "7UMrjUJ2EGmu", "outputId": "97fb0bd8-8a73-4cd0-cd43-eb326067a03d" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/andre.farias/python3.7.0/lib/python3.7/site-packages/tqdm/autonotebook/__init__.py:18: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", " \" (e.g. in jupyter console)\", TqdmExperimentalWarning)\n" ] } ], "source": [ "import os\n", "import pandas as pd\n", "from ast import literal_eval\n", "\n", "from cdqa.utils.converters import pdf_converter\n", "from cdqa.utils.filters import filter_paragraphs\n", "from cdqa.pipeline import QAPipeline\n", "from cdqa.utils.download import download_model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "V1fV_dquOrx0" }, "source": [ "### Download pre-trained reader model and PDF files" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-07-20T13:42:54.139892Z", "start_time": "2019-07-20T13:41:41.869993Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Downloading trained model...\n" ] } ], "source": [ "# Download model\n", "download_model(model='bert-squad_1.1', dir='./models')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-07-20T13:43:21.153039Z", "start_time": "2019-07-20T13:43:20.228398Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 94 }, "colab_type": "code", "id": "yhg8jFjbERzv", "outputId": "3c5414b9-979b-4342-c76d-ab3a05520d3e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Downloading PDF files...\n" ] } ], "source": [ "# Download pdf files from BNP Paribas public news\n", "def download_pdf():\n", " import os\n", " import wget\n", " directory = './data/pdf/'\n", " models_url = [\n", " 'https://invest.bnpparibas.com/documents/1q19-pr-12648',\n", " 'https://invest.bnpparibas.com/documents/4q18-pr-18000',\n", " 'https://invest.bnpparibas.com/documents/4q17-pr'\n", " ]\n", "\n", " print('\\nDownloading PDF files...')\n", "\n", " if not os.path.exists(directory):\n", " os.makedirs(directory)\n", " for url in models_url:\n", " wget.download(url=url, out=directory)\n", "\n", "download_pdf()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QqPK6BV2O-RO" }, "source": [ "### Convert the PDF files into a DataFrame for cdQA pipeline" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-07-20T13:44:01.821890Z", "start_time": "2019-07-20T13:43:22.685954Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "colab_type": "code", "id": "czafu4-aEXXm", "outputId": "d1c13305-b4a3-4dff-f0ec-6bf277ca3b2a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-07-20 15:43:22,713 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /var/folders/fy/3wb1p_ms5r3g97jm4y93pqd40000gn/T/tika-server.jar.\n", "2019-07-20 15:43:34,191 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /var/folders/fy/3wb1p_ms5r3g97jm4y93pqd40000gn/T/tika-server.jar.md5.\n", "2019-07-20 15:43:34,617 [MainThread ] [WARNI] Failed to see startup log message; retrying...\n" ] }, { "data": { "text/html": [ "
\n", " | title | \n", "paragraphs | \n", "
---|---|---|
0 | \n", "4q17-pr.pdf | \n", "[GOOD START OF THE 2020 PLAN * COST OF RISK... | \n", "
1 | \n", "4q18-pr2.pdf | \n", "[SIGNIFICANT PROGRESS IN THE DIGITAL TRANSFORM... | \n", "
2 | \n", "1q19-pr-12648.pdf | \n", "[The business of BNP Paribas was up this quart... | \n", "