{ "metadata": { "name": "", "signature": "sha256:3371fa1a89748bb4e27d718ba764fcb88d6db2ccbb47e0db90c23b1bba22db3e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter 8 - Practical: Searching your own PDF library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-- *A Python Course for the Humanities*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our Information Retrieval system of the previous chapter is simple and also quite efficient, but it lacks many features of a full-blown search engine. One particular downside of the system is that it stores the index in RAM memory. This means that either we have to keep it there or we have to rebuild the index each time we would like to search through a particular collection. We could try to improve our system, but there are some excellent Python packages for Information Retrieval and search engines, we could use. In this section we will explore one of them, called [Whoosh](http://whoosh.readthedocs.org/en/latest/), a search engine and retrieval system written in pure Python. \n", "\n", "![whoosh](files/images/whoosh_logo.png)\n", "\n", "Ever since science-journal giant Elsevier bought the once so promising bibliography management software [Mendeley](http://www.mendeley.com/), I have looked for alternative ways to manage my research PDF collection. For me, one of the most important features of a PDF management tool is to be able to do full text search in the PDFs for the content I am interested in. Uptill today I have not found a tool that fulfills all my needs, so we are going to build one ourselves. We will develop a full-blown search using Whoosh. We'll build a web interface on top of [Flask](http://flask.pocoo.org/) to query our search engine in a user-friendly way. Just to tease you a bit: this is what our search engine will look like:\n", "\n", "![pydf](files/images/pydf.png)\n", "\n", "This chapter is the first in a series of more practical chapters, in which we will build actual applications ready for use by end-users. You won't be learning many new programming techniques in this chapter, but we will introduce you to a large number of modules and packages that are available either in the standard library of Python or as third-party packages. The most important take-home message is that if you think about implementing a piece of software, the first thing you should do, is check whether someone else hasn't done it before you. Chances are good that someone has, and she or he probably has done a better job. \n", "\n", "Before we get started, make sure you have installed both Whoosh and Flask on your computer. If you use Anaconda's Python distribution, you can execute the following command on the commandline:\n", "\n", " conda install whoosh flask\n", " \n", "It is also possible to install the modules using pip:\n", "\n", " pip install whoosh flask\n", " \n", "Finally we will need one of the programs shipped with the PDF reader [Xpdf](http://www.foolabs.com/xpdf/). Please install Xpdf from [here](http://www.foolabs.com/xpdf/download.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "The Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Index` is the most central object in the Whoosh' search engine. It allows us to store our data in a secure and sustainable way and makes it possible to search through our data. Every index requires an index schema which defines the available fields in the index. Fields represent pieces of information for each document in the collection. Examples of fields are: the author of a text, its publication date or the text body itself, and so on and so forth. For our PDF Index we will use a schema consisting of the following fields:\n", "\n", "1. id (a unique document ID);\n", "2. path (the filepath to the document on your computer);\n", "3. source (the filepath to the original source on your computer);\n", "4. author (the author or authors of the text);\n", "5. title (the title of a text);\n", "6. text (the actual text of the pdf).\n", "\n", "All these fields will be indexed for each pdf file in our collection and each field will be searchable. To create our schema in Whoosh, we first import the `Schema` object from `whoosh.fields`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.fields import Schema" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each field we need to specify to Whoosh what kind of field it is. Whoosh defines fields like `KEYWORD` for keyword fields, `ID` for unique identifiable fields (e.g. filepaths), `DATETIME` for fields specifying information about dates, `TEXT` for textual objects, and some others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Have a look at the different field types provided by Whoosh, [here](http://whoosh.readthedocs.org/en/latest/api/fields.html#pre-made-field-types). Which types would be appropriate for the fields defined above?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Double click this cell and write down your answer*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each field type can be passed some additional arguments, such as whether the field should be stored, whether it should be sortable and scorable, etc. For reasons that will be clear later on, we want all our ID fields plus the author and the title field to be stored in the index. We import the appropriate fields and define our schema as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.fields import ID, KEYWORD, TEXT\n", "\n", "pdf_schema = Schema(id = ID(unique=True, stored=True), \n", " path = ID(stored=True), \n", " source = ID(stored=True),\n", " author = TEXT(stored=True), \n", " title = TEXT(stored=True),\n", " text = TEXT)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have defined a schema, we can create the index using the function `create_in`. We'll create the index in the directory `pydf`. The web application will reside in the same folder. Therefore, it is convenient to first direct your notebook to that directory." ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd pydf/" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "from whoosh.index import create_in\n", "\n", "if not os.path.exists(\"pdf-index\"):\n", " os.mkdir(\"pdf-index\")\n", " index = create_in(\"pdf-index\", pdf_schema)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After creation, the index can op openend using the function `open_dir`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.index import open_dir\n", "\n", "index = open_dir(\"pdf-index\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have an index, we can add some documents to it. The `IndexWriter` object let's you do just that. The method `writer()` of the class `Index` returns an instantiation of the `IndexWriter`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "writer = index.writer()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the `add_document` method of `IndexWriter` to add documents to the index. `add_document` accepts keyword arguments corresponding to the fields we specified in our schema. We add our first document:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "writer.add_document(id = 'blei2003', \n", " path = 'data/blei2003.txt',\n", " source = 'static/pdfs/blei2003.pdf',\n", " author = 'David Blei, Andrew Ng, Michael Jordan',\n", " title = 'Latent Dirichlet Allocation',\n", " text = open('data/blei2003.txt', encoding='utf-8').read())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And some more:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "writer.add_document(id = 'goodwyn2013', \n", " path = 'data/goodwyn2013.txt',\n", " source = 'static/pdfs/goodwyn2013.pdf',\n", " author = 'Erik Goodwyn',\n", " title = 'Recurrent motifs as resonant attractor states in the narrative \ufb01eld: a testable model of archetype',\n", " text = open('data/goodwyn2013.txt', encoding='utf-8').read())\n", "\n", "writer.add_document(id = 'meij2009', \n", " path = 'data/meij2009.txt',\n", " source = 'static/pdfs/meij2009.pdf',\n", " author = 'Edgar Meij, Dolf Trieschnigg, Maarten de Rijke, Wessel Kraaij',\n", " title = 'Conceptual language models for domain-speci\ufb01c retrieval',\n", " text = open('data/meij2009.txt', encoding='utf-8').read())\n", "\n", "writer.add_document(id = 'muellner2011', \n", " path = 'data/muellner2011.txt',\n", " source = 'static/pdfs/muellner2011.pdf',\n", " author = 'David Muellner',\n", " title = 'Modern hierarchical, agglomerative clustering algorithms',\n", " text = open('data/muellner2011.txt', encoding='utf-8').read())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling commit() on the `IndexWriter` saves the changes to the index:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "writer.commit()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Searching and Querying" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index contains four documents. How can we search the index for particular documents? Similar to the method `writer` of the `Index` object, the method `searcher` returns a `Searcher` object which allows us to search the index. The `Searcher` object opens a connection to the index, similar to the way Python opens regular files. To prevent the system from running out of file handles, it is best practice to instantiate the searcher within a `with` statement:\n", "\n", " with index.searcher() as searcher:\n", " do something\n", " \n", "We'll do so later when we automatically index a complete collection of pdfs, but for now it is slightly more convenient to instantiate a searcher as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "searcher = index.searcher()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An instantiation of the class `Searcher` has a `search` method that takes as argument a `Query` object. There are two ways to construct `Query` objects: manually or via a query parser. To construct a query that searches for the terms *topic* and *probabilistic* in the text field, we could write something like the following:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.query import Term, And\n", "\n", "query = And([Term(\"text\", \"model\"), Term(\"text\", \"topic\")])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can feed this to the `search` method to obtain a `Result` object:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "results = searcher.search(query)\n", "print('Number of hits:', len(results))\n", "print('Best hit:', results[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that all our four documents contain both the term *model* and *topic* but the paper _about_ topic models is considered the most relevant one for our query. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**a)** Construct the same query, but this time using an `Or` operator." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.query import Or \n", "\n", "# insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b)** Construct a query using the `And` operator that searches for the terms *index* and *topic* in documents of which Dolf Trieschnigg is one of the co-authors:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These query constructs are very explicit and clean. It is, however, much more convenient to use Whoosh' `QuerParser` object to automatically parse strings into `Query` objects. We construct a query parser as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.qparser import QueryParser\n", "\n", "parser = QueryParser(\"text\", index.schema)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have to specify the field in which we want to search and pass the schema of our index. The `QueryParser` is quite an intelligent object. It allows users to group terms using the string `AND` or `OR` and eliminate terms with the string `NOT`. It also allows to manually specify other fields in which to search for specific terms:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "parser.parse(\"probability model prior\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "parser.parse(\"(cluster OR grouping) AND (model OR schema)\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "parser.parse(\"topic index author:'Dolf Trieschnigg'\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "parser.parse(\"clust*\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last query is a wildcard query that attempts to match all terms starting with *clust*. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experiment a little with the `QueryParser` object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Indexing our PDFs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have some basic understanding of Whoosh' most important data structures and functions, it is time to put together a number of Python scripts that will construct a Whoosh index on the basis of your own PDF library.\n", "\n", "Open your favorite text editor and open a new Python file called `indexer.py`. Save that in the directory `python-course/pydf`. Add the schema we defined above to this file as well as the corresponding import statements. Before we will start, it is good to make a list of all the components we need to index our collection. In order to pass our documents to the `add_document` method we will need the following components:\n", "\n", "1. A function that transforms a PDF into text, since Whoosh requires plain text;\n", "2. A way to extract meta information from the PDF files (e.g. the author and the title);\n", "3. The directory or directories containing PDF files we would like to index;\n", "4. A way to remember which files are already in the index.\n", "\n", "Let's start with the first item on our bullet list. Once you have installed Xpdf, a program named `pdftotext` will be available. This program converts PDF files into .txt files. Fire up a terminal or commandline prompt and test whether the command `pdftotext` is available. The [subprocess](https://docs.python.org/3.4/library/subprocess.html) module provides different ways to execute external programs using Python. For example, on a Unix machine, the following lines" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import subprocess\n", "\n", "subprocess.call(['ls', '-l'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "will silently call the program `ls` with the list argument `-l` and return 0 to Python, meaning that the call has been completed. In a similar way we can call `pdftotext` to convert a PDF file into a text file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "subprocess.call(['pdftotext', 'pdfs/blei2003.pdf', 'data/blei2003.txt'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If all went well, Python should return 0 to your notebook. Whoosh requires that all text is encoded in UTF-8. We can pass the argument `-enc UTF-8` to `pdftotext` to make sure our text files are in the right encoding:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "subprocess.call(['pdftotext', '-enc', 'UTF-8', 'pdfs/blei2003.pdf', 'data/blei2003.txt'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function called `pdftotext` in Python that takes as argument the filename of a PDF file. It should convert the PDF into plain text and store the result in the directory `pydf/data`. The .txt file should have the same filename as the PDF file only with a different extension." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "\n", "def pdftotext(pdf):\n", " # insert your code here\n", "\n", "# if your answer is correct this should print the first 1000 bytes of the text file\n", "pdftotext(\"pdfs/blei2003.pdf\")\n", "with open(os.path.join('data', 'blei2003.txt')) as infile:\n", " print(infile.read(1000))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pdftotext` has the option `-htmlmeta` to extract some of the meta data stored in a pdf file. If we run the following command:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "subprocess.call(['pdftotext', '-htmlmeta', '-enc', 'UTF-8', \n", " 'pdfs/muellner2011.pdf', 'data/muellner2011.html'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the output is a HTML file that looks like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(open('data/muellner2011.html').read(500))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will write a function called `parse_html` to extract the meta information and the text from these files. It takes as argument the filepath of the HTML file and returns a dictionary formatted as follows:\n", "\n", " d = {'author': AUTHOR, 'title': TITLE, 'text': TEXT}\n", " \n", "We have used BeautifulSoup in the previous chapter to read and parse web pages. It will be of service here as well." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from bs4 import BeautifulSoup\n", "\n", "def parse_html(filename):\n", " \"\"\"Extract the Author, Title and Text from a HTML file\n", " which was produced by pdftotext with the option -htmlmeta.\"\"\"\n", " with open(filename) as infile:\n", " html = BeautifulSoup(infile, \"html.parser\", from_encoding='utf-8')\n", " d = {'text': html.pre.text}\n", " if html.title is not None:\n", " d['title'] = html.title.text\n", " for meta in html.findAll('meta'):\n", " try:\n", " if meta['name'] in ('Author', 'Title'):\n", " d[meta['name'].lower()] = meta['content']\n", " except KeyError:\n", " continue\n", " return d\n", " \n", "parse_html('data/muellner2011.html')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**a)** The `parse_html` function returns a dictionary consisting of the contents of some of the fields in our index schema. We will need this dictionary when we add new documents to our index. Reimplement the `pdftotext` function. It should convert a PDF file into a HTML file, stored in the directory `pydf/data`. Plug the function `parse_html` into the `pdftotext` function. Write the contents of the `text` field to a `.txt` file in the `pydf/data` directory. Make sure that this text file has the same filename as the PDF file except for the extension." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def pdftotext(pdf):\n", " \"\"\"Convert a pdf to a text file. Extract the Author and Title \n", " and return a dictionary consisting of the author, title and \n", " text.\"\"\"\n", " basename, _ = os.path.splitext(os.path.basename(pdf))\n", " # insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b**) For reasons that will become clear, it would be convenient to add the other fields of our index schema to the dictionary return by `parse_html` as well. Rewrite the function `pdftotext` and add the values for the `source`, `target` and `id` to the dictionary returned by `parse_html`. The values for the `source`, `target` and `id` should match the values we used in the examples above, where we manually added some documents to the index. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import shutil\n", "\n", "def pdftotext(pdf):\n", " \"\"\"Convert a pdf to a text file. Extract the Author and Title \n", " and return a dictionary consisting of the author, title, text\n", " the source path, the path of the converted text file and the \n", " file ID.\"\"\"\n", " basename, _ = os.path.splitext(os.path.basename(pdf))\n", " subprocess.call(['pdftotext', '-enc', 'UTF-8', '-htmlmeta',\n", " pdf, os.path.join('data', basename + '.html')])\n", " data = parse_html(os.path.join('data', basename + '.html'))\n", " with open(os.path.join('data', basename + '.txt'), 'w') as outfile:\n", " outfile.write(data['text'])\n", " # insert your code here\n", "\n", "pdftotext(\"pdfs/muellner2011.pdf\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**c)** After we have extracted the meta information from the HTML file, we no longer need it. The `remove` function in the model `os` can be used to remove files from your computer. Add a line of code to the `pdftotext` function that removes the HTML file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**d)** We need to store the PDF files in the directory `pydf/static/pdfs`. The function `copy` from the `shutil` module allows you to copy or move files from one directory to the other. Add some lines of code to the `pdftotext` function in which you copy the original PDF file to the directory `pydf/static/pdfs`. Make sure that this directory exists. Otherwise, first create it using the function `mkdir` from the `os` module. You can use the function `exists` from the `os.path` module to check whether a particular file or directory exists." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the two hardest points off our list, we can move on to the next two. We need a way to make Python aware of the directories we would like to index. There are many different ways to accomplish this. I choose to make a configuration file in which we store the paths of the directories we want to index (and possibly some other information). The Python module [configparser](https://docs.python.org/3.4/library/configparser.html) provides a class `ConfigParser` with which we can parse configuration files in a format similar to Microsoft Windows INI files:\n", "\n", " [filepaths]\n", " # pdf directory represents the directory or directories\n", " # that you would like to index. Separate multiple directories\n", " # by a semicolon.\n", " pdf directory = pdfs\n", " txt directory = data\n", " index directory = pdf-index\n", " source directory = static/pdfs\n", "\n", " [programpaths]\n", " pdftotext = /usr/local/bin/pdftotext\n", "\n", " [indexer.options]\n", " recompile = no\n", " move = no\n", " search limit = 20\n", " \n", "Copy these lines to a file named `pydf.ini` in the `pydf` directory and adapt the paths to your own. We will read the contents the configuration file using the class `ConfigParser`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import configparser\n", "\n", "config = configparser.ConfigParser()\n", "config.read('pydf.ini')\n", "config.sections()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It returns a strucure that functions much like a dictionary:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "config['filepaths']['pdf directory']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a configuration file, let's adjust the function `pdftotext` to make it a little more general. In the current version we hard-coded the path to the output directory as well as the path to the `pdftotext` binary. We move those elements to the function declaration to make them variable arguments of the function. While we're at it, let's also remove some other code redundancies:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from os.path import basename, splitext\n", "\n", "\n", "def fileid(filepath):\n", " \"\"\"\n", " Return the basename of a file without its extension.\n", " >>> fileid('/some/path/to/a/file.pdf')\n", " file\n", " \"\"\"\n", " base, _ = splitext(basename(filepath))\n", " return base\n", "\n", "\n", "def pdftotext(pdf, outdir='.', sourcedir='source', p2t='pdftotext', move=False):\n", " \"\"\"Convert a pdf to a text file. Extract the Author and Title \n", " and return a dictionary consisting of the author, title, text\n", " the source path, the path of the converted text file and the \n", " file ID.\"\"\" \n", " filename = fileid(pdf)\n", " htmlpath = os.path.join(outdir, filename + '.html')\n", " txtpath = os.path.join(outdir, filename + '.txt')\n", " if not os.path.exists(sourcedir):\n", " os.mkdir(sourcedir)\n", " sourcepath = os.path.join(sourcedir, filename + '.pdf')\n", " subprocess.call([p2t, '-enc', 'UTF-8', '-htmlmeta', pdf, htmlpath])\n", " data = parse_html(htmlpath)\n", " os.remove(htmlpath)\n", " file_action = shutil.move if move else shutil.copy\n", " file_action(pdf, sourcepath)\n", " with open(txtpath, 'w') as outfile:\n", " outfile.write(data['text'])\n", " data['source'] = sourcepath\n", " data['path'] = txtpath\n", " data['id'] = fileid(pdf)\n", " return data\n", "\n", "pdftotext(\"pdfs/blei2003.pdf\", \n", " outdir=config.get('filepaths', 'txt directory'),\n", " sourcedir=config.get('filepaths', 'source directory'),\n", " move=config.getboolean('indexer.options', 'move pdfs'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that set, we are ready to write the main routine of our indexing procedure. I give you the skeleton of the main routine. Tt first sight might seem quite daunting, but it is actually just a prcocedure that puts together all statements we have used before. Fill in the gaps." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import glob\n", "\n", "def index_collection(configpath):\n", " \"Main routine to index a collection of PDFs using Whoosh.\"\n", " config = configparser.ConfigParser()\n", " # read the configuration file\n", " # insert your code here\n", " \n", " recompile = config.getboolean(\"indexer.options\", \"recompile\")\n", " # check whether the supplied index directory already exists\n", " if not os.path.exists(config.get(\"filepaths\", \"index directory\")):\n", " # if not, create a new directory and initialize the index\n", " os.mkdir(config.get(\"filepaths\", \"index directory\"))\n", " index = create_in(config.get(\"filepaths\", \"index directory\"), schema=pdf_schema)\n", " recompile = True\n", " # open a connection to the index\n", " index = # insert your code here\n", " \n", " # retrieve a set of all file IDs we already indexed\n", " indexed = set(map(fileid, os.listdir(config.get(\"filepaths\", \"txt directory\"))))\n", " # initialize a IndexWriter object\n", " writer = # insert your code here\n", " \n", " # iterate over all directories \n", " for directory in config.get(\"filepaths\", \"pdf directory\").split(';'):\n", " # iterate over all PDF files in this directory\n", " for filepath in glob.glob(directory + \"/*.pdf\"):\n", " # poor man's solution to check whether we already indexed this pdf\n", " if fileid(filepath) not in indexed or recompile:\n", " try:\n", " # call the function pdftotext with the correct arguments\n", " data = # insert your code here\n", " \n", " # add the new document to the index\n", " writer.add_document(**data)\n", " except (IOError, UnicodeDecodeError) as error:\n", " print(error)\n", " # commit our changes\n", " # insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now, add some of your own PDF files to the pdf folder `pydf/pdfs` and execute the following cell:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "index_collection('pydf.ini')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Before we continue, move the functions `index_collection`, `pdftotext`, `parse_html` and `fileid` to the file `indexed.py` together with their corresponding imports.** Add the cell above to the file within the main environment at the end of the file:\n", "\n", " if __name__ == '__main__':\n", " index_collection('pydf.ini')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Building a Web Interface with Flask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![flask](http://flask.pocoo.org/static/logo/flask.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, the dry and hard part of our PDF archiving and search app is over. Now it is time to focus on creating a web application with which we can query the index in user-friendly way. I choose to use the microframework [Flask](http://flask.pocoo.org/) which is an elegant web framework that enables you to get a web app up and running in no time. In no time? Really, in no time! Open a new file called `hello.py` in your favorite text editor and add the following lines of code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from flask import Flask\n", "app = Flask(__name__)\n", "\n", "@app.route(\"/\")\n", "def hello():\n", " return \"Hello World!\"\n", "\n", "if __name__ == \"__main__\":\n", " app.run(port=5000)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, open a terminal and run the script using:\n", "\n", " python hello.py\n", " \n", "Direct your browser to http://127.0.0.1:5000/ to see the result. That is what I call a simple web framework, yet a very powerful one too.\n", "\n", "In the directory `pydf/templates/index.html` I created a simple web page that will serve as the landing page of our web application. We can render such pages using Flask's `render_template` function. Open a file called `pydf.py` in the directory `pydf` and add the following lines of code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from flask import Flask, render_template\n", "\n", "app = Flask(__name__)\n", "\n", "@app.route('/')\n", "def index():\n", " return render_template('index.html')\n", "\n", "if __name__ == '__main__':\n", " app.run(debug=True, host='localhost', port=8000, use_reloader=True, threaded=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the application with \n", " \n", " python pydf.py\n", " \n", "and check out the result at http://127.0.0.1:8000/. The search box is not working yet. We need two things: (1) a function to search our collection on the basis of a query and (2) a function to show these results to Flask allowing it to render them properly. We'll start with the search function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**a)** Write a function called `search` that takes as argument a query represented by a string. Open the PDF Index, parse the query, search for the results and return a list of dictionaries in which each dictionary represents a separate search result with the field names as keys and their corresponding values as values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from whoosh.index import open_dir\n", "from whoosh.qparser import QueryParser\n", "\n", "def search(query):\n", " # insert your code here\n", " \n", "print(list(search(\"(topic model) OR (index probability\")))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b**) Whoosh' `Result` object contains a method to create highlighted search result excerpts. Field values that are stored in our index can be directly highlighted by Whoosh, using:\n", "\n", " result.highlights(FIELDNAME)\n", " \n", "Since we did not store the actual text of our pdfs in the collection, we must first open and read the text file corresponding to our search result. That is the reason why we stored the path to our text files in the field `path`. Once we have contents of the text file, we can call the highlight method as follows:\n", "\n", " result.highlights(\"text\", text=contents)\n", "\n", "Adapt the function `search` in such as way that it includes the highlighted search result excerpts for each search result." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def search(query):\n", " # insert your code here\n", " \n", "print(list(search(\"(topic model) OR (index probability\")))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a search function we need a way to represent the results in a format that a browser can read. I choose for the simple solution to directly convert the results into HTML. The following function takes as argument a single search result yielded by `search` and returns a representation in HTML of the result:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def to_html(result):\n", " \"Return a representation of a search result in HTML.\"\n", " title = result['title'] if 'title' in result else result['id']\n", " author = result['author'] if 'author' in result else ''\n", " html = \"\"\"\n", "
\n", " \n", " %s\n", " \n", " %s\n", "
\n", " %s\n", "
\n", " \"\"\" % (result['source'], title, author, result['snippet'])\n", " return html\n", "\n", "print(to_html(next(search(\"topic model\"))))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that in place, all that is left is to write a function that is connected to the search box in the web interface. This function will be called after a user presses enter in the search box and returns the results of the query." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from flask import request, jsonify\n", "\n", "@app.route('/searchbox', methods=['POST'])\n", "def searchbox():\n", " query = request.form['q'].strip()\n", " html_results = '\\n'.join(map(to_html, search(query)))\n", " return jsonify({'html': html_results})" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `searchbox` function is called by a piece of javascript that resides in `static/script.js`. It extracts the query, converts the results to html and returns that as JSON to the same javascript which is responsible for putting it in the right place at our web page.\n", "\n", "An exciting moment: our PDF search application is ready. Take it for a spin:\n", "\n", " python pydf.py\n", " \n", "and direct your browser to http://127.0.0.1:8000/. Have fun!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've reached the end of the chapter. Ignore the code below, it's just here to make the page pretty:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "/*\n", "Placeholder for custom user CSS\n", "\n", "mainly to be overridden in profile/static/custom/custom.css\n", "\n", "This will always be an empty file in IPython\n", "*/\n", "\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\"Creative
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.

" ] } ], "metadata": {} } ] }