{ "metadata": { "name": "", "signature": "sha256:41a53a916b6fa9c23ef4f59d79a40c4961d07c0543c440ca1d3c1a343bd4e93a" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter 7 - Searching large Collections of Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-- *A Python Course for the Humanities*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine you have collected a large number of documents for your research. A large collection of English novels from [Project Gutenberg](http://www.gutenberg.org/), for example, or all volumes of the *New York Times* from 1987 until 2007 (see [here](https://catalog.ldc.upenn.edu/LDC2008T19)). The *New York Times* corpus contains over 1.8 million articles. If on average each article consists of only 500 words (the actual number will probably be higher), the complete corpus will contain about 900 million words. How can we search through such large collections of text? How can we efficiently find the information we are looking for? \n", "\n", "These questions are central to the current chapter. We will discuss important concepts from *Information Retrieval* to index and search collections (of text), such as the *inverted index* and the ranking metric *Okapi BM25*. This wouldn't be a Python course if you didn't have to implement all discussed techniques and methods. We will implement an important search algorithm which allows you to efficiently and reliably extract information from corpora. Along the way we will discuss quite some new and important programming techniques and constructs. We will also further our knowledge about the framework of Object Oriented Programming or OOP. We have quite some ground to cover, so let's get started." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Indexing Collections of Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers. (Manning, NLP course, coursera)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The field of Information Retrieval (IR) deals with developing techniques that allow users to efficiently search through large collections (of text) to find documents that are relevant to the question a particular user has. These days we almost immediately think of *web search* in the context of IR, but there are many other use case scenarios, such as E-mail search, searching for music (e.g. Spotify) or searching for content on your laptop. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![caption](files/images/IR-schema.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most search models in Information Retrieval are similar to the schema above. We have a collection of items, text documents, for example, that are indexed using an indexer with a particular indexing schema. Given a user query, the index is used to retrieve $k$ documents which, after scoring and sorting, are presented to the user. This might seem all a little abstract, but it will all become clear once we start implementing our own IR system.\n", "\n", "Consider a newspaper collection covering the complete 20th century. In this collection we would like to find all documents that mention [Albert Einstein](https://en.wikipedia.org/wiki/Albert_einstein) and [Edwin Hubble](https://en.wikipedia.org/wiki/Edwin_Hubble) but not [Enrico Fermi](https://en.wikipedia.org/wiki/Enrico_Fermi). Those of you familiar with the unix search tool [grep](https://en.wikipedia.org/wiki/Grep) might argue that we could simply use grep to search for all documents that contain both Einstein and Hubble and then filter out those documents that mention Fermi. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Come up with at least two reasons why searching with grep is not a good option." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Double click this cell and write down your answer.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Term Document Incidence Matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To efficiently give an answer to such queries we need to come up with another representation of our collection instead of plain text. Have a look at the following table.\n", "\n", "
\n", " | 12-11-1928 | \n", "04-04-1946 | \n", "03-11-1983 | \n", "19-01-1999 | \n", "
---|---|---|---|---|
Einstein | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "
Hubble | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
Fermi | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
Winfrey | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
Dylan | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.