{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/kasparvonbeelen/ghi_python/main?labpath=6_-_Corpus_Exploration.ipynb)\n", "\n", "\n", "# 6 Corpus Exploration\n", "\n", "\n", "## Text Mining for Historians (with Python)\n", "## A Gentle Introduction to Working with Textual Data in Python\n", "\n", "### Created by Kaspar Beelen and Luke Blaxill\n", "\n", "### For the German Historical Institute, London\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Notebook explores various tools for analysing and comparing texts at the corpus level. As such, these are your first ventures into \"macro-analysis\" with Python. The methods described here are particularly powerful in combination with the techniques for content selection explained in Notebook 5 **Corpus Creation**.\n", "\n", "More specifically, we will have a closer look at:\n", "\n", "- **Keyword in Context Analysis**: Explore context of words, similar to concordance in AntConc\n", "- **Collocations**: Compute which tokens tend to co-occur together\n", "- **Feature selection**: Compute which tokens are distinctive for a subset of texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.1 Keyword in Context\n", "\n", "Computers are excellent in indexing, organizing and retrieving information. However, interpreting information (especially natural language) is still a difficult task. Keyword-in-Context (KWIC) analysis, brings together the best of both worlds: the retrieval power of machines, with the close-reading skills of the historian. KWIC (or concordance) centres a corpus on a specific query term, with `n` words (or characters) to the left and the right. \n", "\n", "In this section, we investigate reports of the London Medical Officers of Health, the [London's Pulse corpus](https://wellcomelibrary.org/moh/). \n", "\n", "> The reports were produced each year by the Medical Officer of Health (MOH) of a district and set out the work done by his public health and sanitary officers. The reports provided vital data on birth and death rates, infant mortality, incidence of infectious and other diseases, and a general statement on the health of the population. \n", "\n", "Source: https://wellcomelibrary.org/moh/about-the-reports/about-the-medical-officer-of-health-reports/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by importing the necessary libraries. Some of the code is explained in previous Notebooks, so won't discuss it in detail here.\n", "\n", "The tools we need are:\n", "- `nltk`: Natural Language Toolkint: for tokenization and concordance\n", "- `pathlib`: a library for managing files and folders" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /Users/kbeelen/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import nltk # import natural language toolkit\n", "nltk.download('stopwords')\n", "from pathlib import Path # import Path object from pathlib\n", "from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34mantconc\u001b[m\u001b[m \u001b[34mpython\u001b[m\u001b[m python.zip\r\n" ] } ], "source": [ "!ls data/MOH/ # list all files in data/MOH/python/" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# in case you unzipped data before\n", "!rm -r data/MOH/python" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: data/MOH/python.zip\n", " creating: data/MOH/python/\n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt \n", " inflating: data/MOH/python/CityofWestminster.1932.b18247945.txt \n", " inflating: data/MOH/python/CityofWestminster.1921.b18247830.txt \n", " inflating: data/MOH/python/PoplarandBromley.1900.b18245754.txt \n", " inflating: data/MOH/python/Poplar.1919.b18120878.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt \n", " inflating: data/MOH/python/CityofWestminster.1907.b18247726.txt \n", " inflating: data/MOH/python/CityofWestminster.1906.b18247714.txt \n", " inflating: data/MOH/python/CityofWestminster.1903.b18247684.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1903.b1824578x.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1938.b18246102.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1960.b18246321.txt \n", " inflating: data/MOH/python/CityofWestminster.1920.b18247829.txt \n", " inflating: data/MOH/python/CityofWestminster.1945.b1824807x.txt \n", " inflating: data/MOH/python/CityofWestminster.1904.b18247696.txt \n", " inflating: data/MOH/python/Westminster.1898.b19874340.txt \n", " inflating: data/MOH/python/Westminster.1900.b19823228.txt \n", " inflating: data/MOH/python/CityofWestminster.1951.b18248135.txt \n", " inflating: data/MOH/python/CityofWestminster.1902.b18247672.txt \n", " inflating: data/MOH/python/CityofWestminster.1905.b18247702.txt \n", " inflating: data/MOH/python/Poplar.1894.b17999157.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1930.b18246023.txt \n", " inflating: data/MOH/python/CityofWestminster.1942.b18248044.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1907.b18245821.txt \n", " inflating: data/MOH/python/CityofWestminster.1928.b18247908.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1943.b18246151.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1959.b1824631x.txt \n", " inflating: data/MOH/python/CityofWestminster.1959.b18248214.txt \n", " inflating: data/MOH/python/CityofWestminster.1936.b18247982.txt \n", " inflating: data/MOH/python/CityofWestminster.1901.b18247660.txt \n", " inflating: data/MOH/python/Poplar.1918.b18120866.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1942.b1824614x.txt \n", " inflating: data/MOH/python/Poplar.1898.b18222882.txt \n", " inflating: data/MOH/python/CityofWestminster.1969.b18248317.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1909.b18245845.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1913.b18245882.txt \n", " inflating: data/MOH/python/CityofWestminster.1908.b18247738.txt \n", " inflating: data/MOH/python/CityofWestminster.1966.b18248287.txt \n", " inflating: data/MOH/python/CityofWestminster.1971.b18248330.txt \n", " inflating: data/MOH/python/CityofWestminster.1922.b18247842.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1964.b18246369.txt \n", " inflating: data/MOH/python/Poplar.1898.b18222833.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1957.b18246291.txt \n", " inflating: data/MOH/python/CityofWestminster.1911.b18247763.txt \n", " inflating: data/MOH/python/CityofWestminster.1910.b18247751.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1956.b1824628x.txt \n", " inflating: data/MOH/python/Poplar.1896.b19885040.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1912.b18245870.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1915.b18245900.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1937.b18246096.txt \n", " inflating: data/MOH/python/CityofWestminster.1956.b18248184.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1940.b18246126.txt \n", " inflating: data/MOH/python/CityofWestminster.1970.b18248329.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1927.b18245997.txt \n", " inflating: data/MOH/python/CityofWestminster.1925.b18247878.txt \n", " inflating: data/MOH/python/CityofWestminster.1941.b18248032.txt \n", " inflating: data/MOH/python/CityofWestminster.1952.b18248147.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1925.b18245973.txt \n", " inflating: data/MOH/python/CityofWestminster.1924.b18247866.txt \n", " inflating: data/MOH/python/CityofWestminster.1953.b18248159.txt \n", " inflating: data/MOH/python/CityofWestminster.1912.b18247775.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1953.b18246254.txt \n", " inflating: data/MOH/python/Westminster.1889.b20057076.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1914.b18245894.txt \n", " inflating: data/MOH/python/CityofWestminster.1954.b18248160.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1941.b18246138.txt \n", " inflating: data/MOH/python/Westminster.1861.b18248408.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1963.b18246357.txt \n", " inflating: data/MOH/python/CityofWestminster.1948.b1824810x.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1910.b18245857.txt \n", " inflating: data/MOH/python/CityofWestminster.1949.b18248111.txt \n", " inflating: data/MOH/python/CityofWestminster.1913.b18247787.txt \n", " inflating: data/MOH/python/Westminster.1859.b1824838x.txt \n", " inflating: data/MOH/python/Westminster.1858.b18248378.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1922.b18245948.txt \n", " inflating: data/MOH/python/CityofWestminster.1931.b18247933.txt \n", " inflating: data/MOH/python/Poplar.1893.b17950454.txt \n", " inflating: data/MOH/python/Westminster.1893.b18018312.txt \n", " inflating: data/MOH/python/Westminster.1899.b18223011.txt \n", " inflating: data/MOH/python/CityofWestminster.1967.b18248299.txt \n", " inflating: data/MOH/python/CityofWestminster.1964.b18248263.txt \n", " inflating: data/MOH/python/CityofWestminster.1917.b18247817.txt \n", " inflating: data/MOH/python/CityofWestminster.1915.b18247805.txt \n", " inflating: data/MOH/python/CityofWestminster.1926.b1824788x.txt \n", " inflating: data/MOH/python/Westminster.1860.b18248391.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1955.b18246278.txt \n", " inflating: data/MOH/python/CityofWestminster.1965.b18248275.txt \n", " inflating: data/MOH/python/PoplarDistrictBowandStratford.1900.b18245730.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1932.b18246047.txt \n", " inflating: data/MOH/python/CityofWestminster.1940.b18248020.txt \n", " inflating: data/MOH/python/Westminster.1894.b18018324.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1926.b18245985.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1952.b18246242.txt \n", " inflating: data/MOH/python/CityofWestminster.1914.b18247799.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1936.b18246084.txt \n", " inflating: data/MOH/python/CityofWestminster.1957.b18248196.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1908.b18245833.txt \n", " inflating: data/MOH/python/CityofWestminster.1927.b18247891.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1911.b18245869.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1928.b1824600x.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1962.b18246345.txt \n", " inflating: data/MOH/python/Westminster.1896.b18038207.txt \n", " inflating: data/MOH/python/Poplar.1897.b18222869.txt \n", " inflating: data/MOH/python/Westminster.1897.b19874352.txt \n", " inflating: data/MOH/python/Westminster.1858.b18248366.txt \n", " inflating: data/MOH/python/CityofWestminster.1955.b18248172.txt \n", " inflating: data/MOH/python/CityofWestminster.1963.b18248251.txt \n", " inflating: data/MOH/python/Poplar.1916.b18120854.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1923.b1824595x.txt \n", " inflating: data/MOH/python/Westminster.1895.b19874364.txt \n", " inflating: data/MOH/python/Westminster.1888.b20057064.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1949.b18246217.txt \n", " inflating: data/MOH/python/PoplarandBromley.1895.b18245742.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1917.b18245912.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1933.b18246059.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1924.b18245961.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1954.b18246266.txt \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " inflating: data/MOH/python/CityofWestminster.1930.b18247921.txt \n", " inflating: data/MOH/python/CityofWestminster.1962.b1824824x.txt \n", " inflating: data/MOH/python/CityofWestminster.1923.b18247854.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1929.b18246011.txt \n", " inflating: data/MOH/python/CityofWestminster.1958.b18248202.txt \n", " inflating: data/MOH/python/CityofWestminster.1937.b18247994.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1921.b18245936.txt \n", " inflating: data/MOH/python/CityofWestminster.1938.b18248007.txt \n", " inflating: data/MOH/python/CityofWestminster.1947.b18248093.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1950.b18246229.txt \n", " inflating: data/MOH/python/Westminster.1891.b2005709x.txt \n", " inflating: data/MOH/python/Westminster.1857.b18248342.txt \n", " inflating: data/MOH/python/CityofWestminster.1933.b18247957.txt \n", " inflating: data/MOH/python/Poplar.1899.b18222894.txt \n", " inflating: data/MOH/python/CityofWestminster.1944.b18248068.txt \n", " inflating: data/MOH/python/CityofWestminster.1909.b1824774x.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1946.b18246187.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1931.b18246035.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1951.b18246230.txt \n", " inflating: data/MOH/python/Westminster.1857.b18248354.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1904.b18245791.txt \n", " inflating: data/MOH/python/CityofWestminster.1960.b18248226.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1961.b18246333.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1939.b18246114.txt \n", " inflating: data/MOH/python/CityofWestminster.1961.b18248238.txt \n", " inflating: data/MOH/python/CityofWestminster.1943.b18248056.txt \n", " inflating: data/MOH/python/CityofWestminster.1950.b18248123.txt \n", " inflating: data/MOH/python/CityofWestminster.1934.b18247969.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1947.b18246199.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1905.b18245808.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1906.b1824581x.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1901.b18245766.txt \n", " inflating: data/MOH/python/Westminster.1892.b20057106.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1944.b18246163.txt \n", " inflating: data/MOH/python/CityofWestminster.1968.b18248305.txt \n", " inflating: data/MOH/python/Poplar.1893.b17997835.txt \n", " inflating: data/MOH/python/PoplarMetropolitanBorough.1958.b18246308.txt \n", " inflating: data/MOH/python/CityofWestminster.1929.b1824791x.txt \n", " inflating: data/MOH/python/CityofWestminster.1939.b18248019.txt \n", " inflating: data/MOH/python/CityofWestminster.1935.b18247970.txt \n", " inflating: data/MOH/python/Poplar.1896.b19885039.txt \n" ] } ], "source": [ "!unzip data/MOH/python.zip -d data/MOH/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data are stored in the following folder structure:\n", "\n", "```\n", "data\n", "|___ moh\n", " |___ python\n", " |____ CityofWestminster.1901.b18247660.txt\n", " |____ ...\n", "```\n", "\n", "The code below:\n", "- harvests all paths to `.txt` files in `working_data/moh/python`\n", "- converts the result to a `list`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can print the paths to ten document with list slicing: `[:10]` means, get document from index positions `0` till `9`. (i.e. the first ten items)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[PosixPath('data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt'), PosixPath('data/MOH/python/CityofWestminster.1932.b18247945.txt'), PosixPath('data/MOH/python/CityofWestminster.1921.b18247830.txt'), PosixPath('data/MOH/python/PoplarandBromley.1900.b18245754.txt'), PosixPath('data/MOH/python/Poplar.1919.b18120878.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt'), PosixPath('data/MOH/python/CityofWestminster.1907.b18247726.txt'), PosixPath('data/MOH/python/CityofWestminster.1906.b18247714.txt'), PosixPath('data/MOH/python/CityofWestminster.1903.b18247684.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt')]\n" ] } ], "source": [ "print(moh_reports_paths[:10]) # print the first ten items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we know where all the files are located we can create a corpus.\n", "\n", "To do this, we apply the following steps:\n", "\n", "- create an empty list variable where we will store the tokens of the corpus (line 3)\n", "- iterate over the collected paths (line 5)\n", "- read the text file (line 6)\n", "- lowercase the text (line 6)\n", "- tokenize the string (line 7): this converts the string to a list of tokens\n", "- iterate over tokens (line 8)\n", "- test if a token contains only alphabetic characters (line 9)\n", "- add a token to the list if line 9 evaluates to True (line 10)\n", "\n", "The general flow of the program is similar to what we've seen before: we create an empty list where we store information from our text collection, in this case, all alphabetic tokens.\n", "\n", "We use one more Notebook functionality `%%time` to print how long the cell took to run.\n", "\n", "It could take a few seconds for the cell to run, so please be a bit patient:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "collected 3550169 tokens\n", "CPU times: user 2.75 s, sys: 150 ms, total: 2.9 s\n", "Wall time: 2.97 s\n" ] } ], "source": [ "%%time\n", "\n", "corpus = [] # inititialize an empty list where we will store the MOH reports\n", "\n", "for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths \n", " text_lower = open(p).read().lower() # read the text files and lowercase the string\n", " tokens = wordpunct_tokenize(text_lower) # tokenize the string\n", " for token in tokens: # iterate over the tokens\n", " if token.isalpha(): # test if token only contains alphabetic characteris\n", " corpus.append(token) # if the above test evaluates to True, append token to the corpus list\n", "print('collected', len(corpus),'tokens')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While this program works perfectly fine, it's not the most efficient code. The example below is a bit better, especially if you're confronted with lots of text files. \n", "\n", "- the `with open` statement is a convenient way of handling the opening **and** closing of files (to make sure you don't keep all information in memory), which would slow down the execution of your program\n", "- line 8 shows a list comprehension, this is similar to a `for` loop but faster and more concise.\n", "\n", "We won't spend too much time discussing list comprehensions. The example below should suffice for now. We write a small program that collects odd numbers. First, we generate a list of numbers with `range(10)`..." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# see the output of range(10)\n", "list(range(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... we test for division by 2: `%` is the **modulus operator**: \"which returns the remainder after dividing the left-hand operand by right-hand operand\". `n % 2` evaluates to `0` if a number `n` can be divided by `2`. In Python `0` is equal to `False`, meaning if `n % 2` evaluates to `0`/`False` we won't append the number to `odd`. if it evaluates to any other integer, we'll append `n` to `odd`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n" ] } ], "source": [ "print(10%2)\n", "print(15%2)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1, 3, 5, 7, 9]\n", "CPU times: user 422 µs, sys: 444 µs, total: 866 µs\n", "Wall time: 486 µs\n" ] } ], "source": [ "%%time\n", "# program for find odd numbers\n", "numbers = range(10) # get numbers 0 to 9\n", "odd = [] # empty list where we store even numbers\n", "for k in numbers: # iterate over numbers\n", " if k % 2: # test if number if divisible by 2\n", " odd.append(k) # if True append\n", "print(odd) # print number of tokens collected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same can be achieved with just one line of code using a list comprehension." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3 µs, sys: 1 µs, total: 4 µs\n", "Wall time: 7.87 µs\n", "[1, 3, 5, 7, 9]\n" ] } ], "source": [ "%time\n", "odd = [k for k in range(10) if k % 2]\n", "print(odd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### -- Exercise\n", "\n", "To see differences in performance, do the following:\n", "\n", "- Remove the `print()` statement\n", "- Increase the size of the list, i.e. change `range(10)` to `range(1000000)`.\n", "- Compare the **Wall time** of these cells" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now returning to our example: run the slightly more efficient code and observe that it produces the same output, just faster!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "collected 3550169 tokens\n", "CPU times: user 2.44 s, sys: 150 ms, total: 2.59 s\n", "Wall time: 2.62 s\n" ] } ], "source": [ "%%time\n", "\n", "corpus = [] # inititialize an empty list where we will store the MOH reports\n", "\n", "for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths \n", " with open(p) as in_doc: # make sure to close the document after opening it\n", " tokens = wordpunct_tokenize(in_doc.read().lower())\n", " corpus.extend([t for t in tokens if t.isalpha()]) # list comprehension \n", "print('collected', len(corpus),'tokens') # print number of tokens collected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After collecting all tokens in a `list` we can convert this to another data type: an NLTK `Text` object. The cell below shows the results of the conversion." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(type(corpus))\n", "nltk_corpus = nltk.text.Text(corpus) # convert the list of tokens to a nltk.text.Text object\n", "print(type(nltk_corpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why is this useful? Well the NLTK`Text` object comes with many useful methods for corpus exploration. To inspect all the tools attached to a `Text` object, apply the `help()` function to `nltk_corpus` or (`help(nltk.text.Text)` does the same trick). You have to scroll down a bit (ignore all methods starting with `__`) to inspect the class methods." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on Text in module nltk.text object:\n", "\n", "class Text(builtins.object)\n", " | Text(tokens, name=None)\n", " | \n", " | A wrapper around a sequence of simple (string) tokens, which is\n", " | intended to support initial exploration of texts (via the\n", " | interactive console). Its methods perform a variety of analyses\n", " | on the text's contexts (e.g., counting, concordancing, collocation\n", " | discovery), and display the results. If you wish to write a\n", " | program which makes use of these analyses, then you should bypass\n", " | the ``Text`` class, and use the appropriate analysis function or\n", " | class directly instead.\n", " | \n", " | A ``Text`` is typically initialized from a given document or\n", " | corpus. E.g.:\n", " | \n", " | >>> import nltk.corpus\n", " | >>> from nltk.text import Text\n", " | >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))\n", " | \n", " | Methods defined here:\n", " | \n", " | __getitem__(self, i)\n", " | \n", " | __init__(self, tokens, name=None)\n", " | Create a Text object.\n", " | \n", " | :param tokens: The source text.\n", " | :type tokens: sequence of str\n", " | \n", " | __len__(self)\n", " | \n", " | __repr__(self)\n", " | Return repr(self).\n", " | \n", " | __str__(self)\n", " | Return str(self).\n", " | \n", " | __unicode__ = __str__(self)\n", " | \n", " | collocations(self, num=20, window_size=2)\n", " | Print collocations derived from the text, ignoring stopwords.\n", " | \n", " | :seealso: find_collocations\n", " | :param num: The maximum number of collocations to print.\n", " | :type num: int\n", " | :param window_size: The number of tokens spanned by a collocation (default=2)\n", " | :type window_size: int\n", " | \n", " | common_contexts(self, words, num=20)\n", " | Find contexts where the specified words appear; list\n", " | most frequent common contexts first.\n", " | \n", " | :param word: The word used to seed the similarity search\n", " | :type word: str\n", " | :param num: The number of words to generate (default=20)\n", " | :type num: int\n", " | :seealso: ContextIndex.common_contexts()\n", " | \n", " | concordance(self, word, width=79, lines=25)\n", " | Prints a concordance for ``word`` with the specified context window.\n", " | Word matching is not case-sensitive.\n", " | \n", " | :param word: The target word\n", " | :type word: str\n", " | :param width: The width of each line, in characters (default=80)\n", " | :type width: int\n", " | :param lines: The number of lines to display (default=25)\n", " | :type lines: int\n", " | \n", " | :seealso: ``ConcordanceIndex``\n", " | \n", " | concordance_list(self, word, width=79, lines=25)\n", " | Generate a concordance for ``word`` with the specified context window.\n", " | Word matching is not case-sensitive.\n", " | \n", " | :param word: The target word\n", " | :type word: str\n", " | :param width: The width of each line, in characters (default=80)\n", " | :type width: int\n", " | :param lines: The number of lines to display (default=25)\n", " | :type lines: int\n", " | \n", " | :seealso: ``ConcordanceIndex``\n", " | \n", " | count(self, word)\n", " | Count the number of times this word appears in the text.\n", " | \n", " | dispersion_plot(self, words)\n", " | Produce a plot showing the distribution of the words through the text.\n", " | Requires pylab to be installed.\n", " | \n", " | :param words: The words to be plotted\n", " | :type words: list(str)\n", " | :seealso: nltk.draw.dispersion_plot()\n", " | \n", " | findall(self, regexp)\n", " | Find instances of the regular expression in the text.\n", " | The text is a list of tokens, and a regexp pattern to match\n", " | a single token must be surrounded by angle brackets. E.g.\n", " | \n", " | >>> print('hack'); from nltk.book import text1, text5, text9\n", " | hack...\n", " | >>> text5.findall(\"<.*><.*>\")\n", " | you rule bro; telling you bro; u twizted bro\n", " | >>> text1.findall(\"(<.*>)\")\n", " | monied; nervous; dangerous; white; white; white; pious; queer; good;\n", " | mature; white; Cape; great; wise; wise; butterless; white; fiendish;\n", " | pale; furious; better; certain; complete; dismasted; younger; brave;\n", " | brave; brave; brave\n", " | >>> text9.findall(\"{3,}\")\n", " | thread through those; the thought that; that the thing; the thing\n", " | that; that that thing; through these than through; them that the;\n", " | through the thick; them that they; thought that the\n", " | \n", " | :param regexp: A regular expression\n", " | :type regexp: str\n", " | \n", " | generate(self, words)\n", " | Issues a reminder to users following the book online\n", " | \n", " | index(self, word)\n", " | Find the index of the first occurrence of the word in the text.\n", " | \n", " | plot(self, *args)\n", " | See documentation for FreqDist.plot()\n", " | :seealso: nltk.prob.FreqDist.plot()\n", " | \n", " | readability(self, method)\n", " | \n", " | similar(self, word, num=20)\n", " | Distributional similarity: find other words which appear in the\n", " | same contexts as the specified word; list most similar words first.\n", " | \n", " | :param word: The word used to seed the similarity search\n", " | :type word: str\n", " | :param num: The number of words to generate (default=20)\n", " | :type num: int\n", " | :seealso: ContextIndex.similar_words()\n", " | \n", " | unicode_repr = __repr__(self)\n", " | \n", " | vocab(self)\n", " | :seealso: nltk.prob.FreqDist\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors defined here:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", "\n" ] } ], "source": [ "help(nltk_corpus) # show methods attached to the nltk.text.Text object or nltk_corpus variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a closer look at `.concordance()`. According to the official documentation this method \n", "> Prints a concordance for ``word`` with the specified context window. Word matching is not case-sensitive.\n", "\n", "It takes multiple arguments:\n", " - word: query term\n", " - width: the context window, i.e. determines the number of character printed \n", " - lines: determines the number of lines to show (i.e. KWIC examples)\n", "\n", "The first line of the output states the total number of hits for the query term (`Displaying * of * matches:`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example code below prints the context of the word **\"poor\"**." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Displaying 10 of 1112 matches:\n", "mes difficult to arrange but the friends of the poor and the charity organisation society can often \n", "l analysis in one case the milk proved to be of poor quality the work is carried out in the ordinary\n", " fat fair quality between per cent and per cent poor quality between per cent and per cent adulterat\n", "able i district total good quality fair quality poor quality adulterated no percent no percent no pe\n", "in which the applicant is already in receipt of poor law relief or is considered ought to be referen\n", "ing cases previously notified under to to total poor law institutions sanatoria poor law institution\n", "der to to total poor law institutions sanatoria poor law institutions sanatoria pulmonary males fema\n", "g to tuberculosis and the treatment of cases in poor law and other hospitals advance in social well \n", "y in which the fat was between and per cent and poor or inferior quality in which the fat was betwee\n", "ood quality no per cent fair quality no percent poor quality no percent adulterated no percent south\n" ] } ], "source": [ "nltk_corpus.concordance('poor',width=100,lines=10) # print the context of poor, window = 100 character" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### --Exercise\n", "\n", "Use KWIC analysis to compare the word \"poor\" in MOsH reportss from the City of Westminster and Poplar. Using everything you learned the previous Notebook\n", "- Create two subcopora one with Westminster, one with Poplar reports\n", "- Tokenize the texts and convert the list of tokens to an NLTK `Text` object\n", "- Use concardance to analyse the context of the work \"poor\"" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Enter code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.2 Collocations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While KWIC analysis is useful for investigating the context of words, it is a method that doesn't scale well: it helps with the close reading of around 100 words, but when examples run in the thousands it becomes more difficult. Collocations can help quantify the semantics of term, or how the meaning of words is different between corpora or subsamples of a corpus.\n", "\n", "Collocations, as explained in the AntConc section, are often multi-word expressions containing tokens that tend to co-occur, such \"New York City\" (the span between words can be longer, they don't have to appear next to each other)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NLTK `Text` object has `collocations()` function. Below we print and explain the documentation.\n", "\n", "> collocations(self, num=20, window_size=2)\n", " Print collocations derived from the text, ignoring stopwords.\n", " \n", "It has the following parameters:\n", "> `:param num:` The maximum number of collocations to print.\n", "\n", "The number of collocations to print (if not specified it will print 20)\n", "\n", "> `:param window_size:` The number of tokens spanned by a collocation (default=2)\n", "\n", "If `window_size=2` collocations will only include bigrams (words occurring next to each other). But sometimes we wish to include longer intervals, to make the co-occurrence of words within a broader window more visible, this allows us to go beyond multiword expressions and study the distribution of words in a corpus more generally. For example, we could look if \"men\" and \"women\" are discussed in each other's context (within a span of 10), even if they don't appear next to each other. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "per cent; public health; county council; london county; medical\n", "officer; scarlet fever; whooping cough; males females; local\n", "government; legal proceedings; dwelling houses; poplar bromley; small\n", "pox; ice cream; sub district; government board; child welfare; city\n", "council; death rate; bromley bow\n", "CPU times: user 9.61 s, sys: 76.3 ms, total: 9.68 s\n", "Wall time: 9.72 s\n" ] } ], "source": [ "%%time\n", "nltk_corpus.collocations(window_size=2)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "street street; per cent; public health; county council; months months;\n", "london county; medical officer; scarlet fever; males females; poplar\n", "bromley; london council; road street; bromley bow; see page; road\n", "road; whooping cough; officer health; medical health; poplar bow;\n", "small pox\n", "CPU times: user 29.2 s, sys: 389 ms, total: 29.6 s\n", "Wall time: 30.5 s\n" ] } ], "source": [ "%%time \n", "nltk_corpus.collocations(window_size=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the `.collocations()` method provides a convenient tool for obtaining collocations from a corpus, its functionality remains rather limited. Below we will inspect the collocation functions of NLTK in more detail, giving you more power as well as precision.\n", "\n", "Before we start we import all the required tools that `nltk.collocations` provides. This is handled by the `import *`, similar to a wildcard, it matches and loads all functions from `nltk.collocations`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from nltk.collocations import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to select an association measure to compute the \"strength\" with which two tokens are \"attracted\" to each other. In general, collocations are words that are likely to appear together (within a specific context or window size). This explains why \"the \"red wine\" is a strong collocation and \"the wine\" less so.\n", "\n", "NLTK provides us with different measures, which you can print and investigate in more detail. Many of the functions refer to the classic NLP Handbook of Manning and Schütze, [\"Foundations of statistical natural language processing\"](https://nlp.stanford.edu/fsnlp/)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "bigram_measures = nltk.collocations.BigramAssocMeasures()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on BigramAssocMeasures in module nltk.metrics.association object:\n", "\n", "class BigramAssocMeasures(NgramAssocMeasures)\n", " | A collection of bigram association measures. Each association measure\n", " | is provided as a function with three arguments::\n", " | \n", " | bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)\n", " | \n", " | The arguments constitute the marginals of a contingency table, counting\n", " | the occurrences of particular events in a corpus. The letter i in the\n", " | suffix refers to the appearance of the word in question, while x indicates\n", " | the appearance of any word. Thus, for example:\n", " | \n", " | n_ii counts (w1, w2), i.e. the bigram being scored\n", " | n_ix counts (w1, *)\n", " | n_xi counts (*, w2)\n", " | n_xx counts (*, *), i.e. any bigram\n", " | \n", " | This may be shown with respect to a contingency table::\n", " | \n", " | w1 ~w1\n", " | ------ ------\n", " | w2 | n_ii | n_oi | = n_xi\n", " | ------ ------\n", " | ~w2 | n_io | n_oo |\n", " | ------ ------\n", " | = n_ix TOTAL = n_xx\n", " | \n", " | Method resolution order:\n", " | BigramAssocMeasures\n", " | NgramAssocMeasures\n", " | builtins.object\n", " | \n", " | Class methods defined here:\n", " | \n", " | chi_sq(n_ii, n_ix_xi_tuple, n_xx) from abc.ABCMeta\n", " | Scores bigrams using chi-square, i.e. phi-sq multiplied by the number\n", " | of bigrams, as in Manning and Schutze 5.3.3.\n", " | \n", " | fisher(*marginals) from abc.ABCMeta\n", " | Scores bigrams using Fisher's Exact Test (Pedersen 1996). Less\n", " | sensitive to small counts than PMI or Chi Sq, but also more expensive\n", " | to compute. Requires scipy.\n", " | \n", " | phi_sq(*marginals) from abc.ABCMeta\n", " | Scores bigrams using phi-square, the square of the Pearson correlation\n", " | coefficient.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Static methods defined here:\n", " | \n", " | dice(n_ii, n_ix_xi_tuple, n_xx)\n", " | Scores bigrams using Dice's coefficient.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data and other attributes defined here:\n", " | \n", " | __abstractmethods__ = frozenset()\n", " | \n", " | ----------------------------------------------------------------------\n", " | Class methods inherited from NgramAssocMeasures:\n", " | \n", " | jaccard(*marginals) from abc.ABCMeta\n", " | Scores ngrams using the Jaccard index.\n", " | \n", " | likelihood_ratio(*marginals) from abc.ABCMeta\n", " | Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4.\n", " | \n", " | pmi(*marginals) from abc.ABCMeta\n", " | Scores ngrams by pointwise mutual information, as in Manning and\n", " | Schutze 5.4.\n", " | \n", " | poisson_stirling(*marginals) from abc.ABCMeta\n", " | Scores ngrams using the Poisson-Stirling measure.\n", " | \n", " | student_t(*marginals) from abc.ABCMeta\n", " | Scores ngrams using Student's t test with independence hypothesis\n", " | for unigrams, as in Manning and Schutze 5.3.1.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Static methods inherited from NgramAssocMeasures:\n", " | \n", " | mi_like(*marginals, **kwargs)\n", " | Scores ngrams using a variant of mutual information. The keyword\n", " | argument power sets an exponent (default 3) for the numerator. No\n", " | logarithm of the result is calculated.\n", " | \n", " | raw_freq(*marginals)\n", " | Scores ngrams by their frequency\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors inherited from NgramAssocMeasures:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", "\n" ] } ], "source": [ "help(bigram_measures)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our example we use pointwise mutual inforamtion (pmi) to compute collocations." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method pmi in module nltk.metrics.association:\n", "\n", "pmi(*marginals) method of abc.ABCMeta instance\n", " Scores ngrams by pointwise mutual information, as in Manning and\n", " Schutze 5.4.\n", "\n" ] } ], "source": [ "help(bigram_measures.pmi)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "![pmi](https://miro.medium.com/max/930/1*OoI8_cZQwYGJEUjzozBOCw.png)\n", "\n", "`pmi` is a rather straightforward metric, in the case of bigrams (i.e. collocations of length two and window size two):\n", "- compute the total number of tokens in a corpus, assume this is `n` (3435)\n", "- compute the probability of `a` and `b` appearing as a bigram. If the bigram `(a,b)` occurs 10 times, the probability (`P(a,b)` = 10/3435 = 0.0029)\n", "- compute the probability of observing `a` and `b` across the whole corpus. For example if `a` appears `30` times and b `45`, their respective probabilities are `P(a)` = 30/3435 = 0.0087 and P(b) = 45/3435 = 0.0131. We then multiple `P(a)` and `P(b)` to obtain the denominator 0.0087 `*` 0.0131 = 0.0001\n", "- next we 0.0029 / 0.0001 = 28.9999 and log this value log2(28.9999)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.6692787866546315" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from numpy import log2\n", "nom = 10/3435\n", "denom = (30/3435) * (45/3435)\n", "mpi = log2(nom/denom)\n", "mpi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get collocations by their `pmi` scores, we apply the `.from_words()` method to the `nltk_corpus` (or any list of tokens). The result of this operation is stored in a `finder` object which we can subsequently used to rank and print collocations. \n", "\n", "Note that the results below look somewhat strange, these aren't very meaningful collocates." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('abso', 'lutely'),\n", " ('acidi', 'lacfc'),\n", " ('acquires', 'setiological'),\n", " ('adolph', 'mussi'),\n", " ('adolphus', 'massie'),\n", " ('adultorated', 'sanples'),\n", " ('adver', 'tising'),\n", " ('aeql', 'rrhage'),\n", " ('alathilde', 'christoffersen'),\n", " ('alio', 'wances')]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finder = BigramCollocationFinder.from_words(nltk_corpus)\n", "finder.nbest(bigram_measures.pmi, 10) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results are rather spurious, why? If, for example `a` and `b` both appear only once **and** next to each other, the `pmi` score will be high. But such collocations aren't meaningful collocation, more a rare artefact of the data.\n", "To solve this problem, we filter by ngram frequency, removing in our case all bigrams that appear less than 3 times with `.apply_freq_filter()` function." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method apply_freq_filter in module nltk.collocations:\n", "\n", "apply_freq_filter(min_freq) method of nltk.collocations.BigramCollocationFinder instance\n", " Removes candidate ngrams which have frequency less than min_freq.\n", "\n" ] } ], "source": [ "help(finder.apply_freq_filter)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('bowers', 'gifford'),\n", " ('carrie', 'simuelson'),\n", " ('culex', 'pipiens'),\n", " ('heatherfield', 'ascot'),\n", " ('holmes', 'godson'),\n", " ('lehman', 'ashmead'),\n", " ('locum', 'tenens'),\n", " ('nemine', 'contradicente'),\n", " ('quinton', 'polyclinic'),\n", " ('rhesus', 'incompatibility')]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finder.apply_freq_filter(3)\n", "finder.nbest(bigram_measures.pmi, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now many names appear. We can even be more strict and use a higher threshold for filtering." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('braxton', 'hicks'),\n", " ('herman', 'olsen'),\n", " ('posterior', 'basal'),\n", " ('arterio', 'sclerosis'),\n", " ('brucella', 'abortus'),\n", " ('burnishers', 'diamond'),\n", " ('pillows', 'bolsters'),\n", " ('carvers', 'gilders'),\n", " ('sweetmeats', 'cosaques'),\n", " ('bookbinding', 'lithographers')]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finder.apply_freq_filter(20)\n", "finder.nbest(bigram_measures.pmi, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to change the window size, but the larger the window size the longer the computation takes" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('tr', 'tr'),\n", " ('felix', 'twede'),\n", " ('barmen', 'potmen'),\n", " ('harbott', 'chauffeur'),\n", " ('axel', 'welin'),\n", " ('betha', 'nicholson'),\n", " ('malcolm', 'donaldson'),\n", " ('roasters', 'grinders'),\n", " ('spasmodic', 'stridulous'),\n", " ('soapmaking', 'lubricating')]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finder = BigramCollocationFinder.from_words(nltk_corpus, window_size = 5)\n", "finder.apply_freq_filter(10)\n", "finder.nbest(bigram_measures.pmi, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, you can focus on collocations that contains a specific token, i.e. for example get all collocations with the token \"poor\". We have pass function to `.apply_ngram_filter()`. At this point, you shouldn't worry about the code, only understand how to adapt it (see exercise below). " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('poor', 'attenders'),\n", " ('poor', 'palatines'),\n", " ('poor', 'genl'),\n", " ('poor', 'law'),\n", " ('compositions', 'poor'),\n", " ('poor', 'sufferers'),\n", " ('poor', 'quality'),\n", " ('sleep', 'poor'),\n", " ('poor', 'visibility'),\n", " ('failures', 'poor')]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def token_filter_poor(*w):\n", " return 'poor' not in w\n", "\n", "finder = BigramCollocationFinder.from_words(nltk_corpus)\n", "finder.apply_freq_filter(3)\n", "finder.apply_ngram_filter(token_filter_poor)\n", "finder.nbest(bigram_measures.pmi, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### -- Exercise\n", "Copy-paste the above code and create a program that prints the first 10 collocations with the word \"women\".\n", "- change the frequency threshold\n", "- explore otherr association measure, to what extent do your results change?" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Enter code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3 Feature selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last section of this Notebook, we explore computational methods for finding words that characterize a collection: we try to select tokens (more generally features) that distinguish a particular set of documents vis-a-vis another corpus. \n", "\n", "Such comparisons help us determine what type of language use was distinctive for a particular group or (such as a political party) period or location. We continue with the example of the MOsH reports, but compare the language of different boroughs, the affluent Westminster with the industrial, and considerably poorer, Poplar.\n", "\n", "The code below should look familiar, but we made a few changes.\\\n", "- to make sure all data are in the right place, we download and extract it again\n", "- we create two empty lists `corpus` and `labels`. In the former we store our text documents (each item in the list is one text file/string), the latter contains labels, `0` for Poplar and `1` for Westminster. We collect these labels in parallel with the text, i.e. the if the first item in `corpus` is a text from Westminster, the first label in `labels` is `1`.\n", "- we use `with open` to automatically close each document after opening it (line 1)\n", "- lines 9 - 12 contain an `if else` statement: if the string `westminster` appears in the file name we add `1` to `labels`, otherwise `0`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 247 ms, sys: 61.6 ms, total: 308 ms\n", "Wall time: 368 ms\n" ] } ], "source": [ "%%time\n", "import nltk # import natural language toolkit\n", "from pathlib import Path # import Path object from pathlib\n", "from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize\n", "\n", "moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python\n", "\n", "corpus = [] # save corpus here\n", "labels = [] # save labels here\n", "\n", "for r in moh_reports_paths: # iterate over documents\n", " with open(r) as in_doc: # open document (also take care close it later)\n", " corpus.append(in_doc.read().lower()) # append the lowercased document to corpus\n", " \n", " if 'westminster' in r.name.lower(): # check if westeminster appear in the file name\n", " labels.append(1) # if so, append 1 to labels\n", " else: # if not\n", " labels.append(0) # append 0 to labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each document should correspond to one label. The lists `labels` and `corpus` should have equal length." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "159 159\n" ] } ], "source": [ "print(len(labels),len(corpus))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "print(len(labels) == len(corpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As said earlier, we collect labels for each document, `1` for Westminster and `0` for Poplar (it could also be reverse, of course!). It is important that each label corresponds correctly with a text file in `corpus`. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]\n" ] } ], "source": [ "print(labels[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check this by printing the first hundred characters of the first document (labelled as `0`)...\n", "\n", "Note that `corpus[0]` returns the first document, from which we slice the first hundred character `[:100]`." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"l-'&rary pop s-/ metropolitan borough of poplar . abridged interim report on the health of the borou\"" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus[0][:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and the second document (labelled as `0`)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'city of westminster. report of the medical officer of health for the year . - 1932 andrew j. shinnie'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus[1][:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking your code by eyeballing the output is always good practice. Even if your code runs, it could still contain bugs, which are commonly referred to as \"semantic errors\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the most distinctive words (for both report from Westminster and Poplar) we use an external library [`TextFeatureSelection`](https://pypi.org/project/TextFeatureSelection/). Python has a very rich and fast-evolving ecosystem. If you have a problem, it's very likely someone wrote a library to help you with this problem. We first have to install this package (it's not yet part of Colab)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "import TextFeatureSelection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can apply the `TextFeatureSelection` library. The documentation is available [here](https://pypi.org/project/TextFeatureSelection/).\n", "\n", "Computing the features requires only a few lines of code. You only need to provide \n", "- a corpus for the `input_doc_list` parameter\n", "- a list of labels for the `target` parameter\n", "\n", "`TextFeatureSelection` then uses various metrics to compute the extent to which words are associated with a label. The output of this process is a `pandas.DataFrame`. Working with tabular data and data frames will be extensively discussed in Part II of this course. For now, we show you how to sort information and get the most distinctive words or features." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on module TextFeatureSelection:\n", "\n", "NAME\n", " TextFeatureSelection - Text features selection.\n", "\n", "CLASSES\n", " builtins.object\n", " TextFeatureSelection\n", " TextFeatureSelectionGA\n", " \n", " class TextFeatureSelection(builtins.object)\n", " | TextFeatureSelection(target, input_doc_list, stop_words=None, metric_list=['MI', 'CHI', 'PD', 'IG'])\n", " | \n", " | Compute score for each word to identify and select words which result in better model performance.\n", " | \n", " | Parameters\n", " | ----------\n", " | target : list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.\n", " | \n", " | input_doc_list : List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.\n", " | \n", " | stop_words : Words for which you will not want to have metric values calculated. Default is blank.\n", " | \n", " | metric_list : List object which has the metric to be calculated. There are 4 metric which are being computed as 'MI','CHI','PD','IG'. you can specify one or more than one as a list object. Default is ['MI','CHI','PD','IG']. \n", " | \n", " | Returns\n", " | -------\n", " | values_df : pandas dataframe with results. unique words and score from the desried metric.\n", " | \n", " | Examples\n", " | --------\n", " | The following example shows how to retrieve the 5 most informative\n", " | features in the Friedman #1 dataset.\n", " | \n", " | >>> from sklearn.feature_selection.text import TextFeatureSelection\n", " | \n", " | >>> #Multiclass classification problem\n", " | >>> input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']\n", " | >>> target=['Positive','Positive','Negative','Neutral','Neutral']\n", " | >>> result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()\n", " | >>> print(result_df)\n", " | \n", " | word list word occurence count Proportional Difference Mutual Information Chi Square Information Gain\n", " | 0 am 1 1.0 0.916291 1.875000 0.089257\n", " | 1 an 1 1.0 0.916291 1.875000 0.089257\n", " | 2 at 1 1.0 1.609438 5.000000 0.000000\n", " | 3 awesome 1 1.0 0.916291 1.875000 0.089257\n", " | 4 back 1 1.0 1.609438 5.000000 0.000000\n", " | 5 chips 1 1.0 0.916291 1.875000 0.089257\n", " | 6 difficult 1 1.0 1.609438 5.000000 0.000000\n", " | 7 do 1 1.0 0.916291 1.875000 0.089257\n", " | 8 had 2 1.0 0.223144 0.833333 0.008164\n", " | 9 happy 1 1.0 0.916291 1.875000 0.089257\n", " | 10 home 1 1.0 1.609438 5.000000 0.000000\n", " | 11 is 1 1.0 1.609438 5.000000 0.000000\n", " | 12 just 2 1.0 0.223144 0.833333 0.008164\n", " | 13 lunch 1 1.0 0.916291 1.875000 0.089257\n", " | 14 stayed 1 1.0 1.609438 5.000000 0.000000\n", " | 15 terrain 1 1.0 1.609438 5.000000 0.000000\n", " | 16 this 1 1.0 1.609438 5.000000 0.000000\n", " | 17 to 1 1.0 1.609438 5.000000 0.000000\n", " | 18 trek 1 1.0 1.609438 5.000000 0.000000\n", " | 19 very 2 1.0 0.916291 2.222222 0.008164\n", " | 20 want 1 1.0 0.916291 1.875000 0.089257\n", " | 21 weekend 1 1.0 0.916291 1.875000 0.089257\n", " | 22 wish 1 1.0 1.609438 5.000000 0.000000\n", " | 23 you 1 1.0 0.916291 1.875000 0.089257\n", " | \n", " | \n", " | \n", " | >>> #Binary classification\n", " | >>> input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']\n", " | >>> target=[1,1,0,1]\n", " | >>> result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()\n", " | >>> print(result_df)\n", " | word list word occurence count Proportional Difference Mutual Information Chi Square Information Gain\n", " | 0 algebra 1 -1.0 1.386294 4.000000 0.0\n", " | 1 am 2 1.0 -inf 1.333333 0.0\n", " | 2 cannot 1 -1.0 1.386294 4.000000 0.0\n", " | 3 content 1 1.0 -inf 0.444444 0.0\n", " | 4 go 1 1.0 -inf 0.444444 0.0\n", " | 5 having 1 1.0 -inf 0.444444 0.0\n", " | 6 learn 1 -1.0 1.386294 4.000000 0.0\n", " | 7 learning 1 -1.0 1.386294 4.000000 0.0\n", " | 8 life 1 1.0 -inf 0.444444 0.0\n", " | 9 linear 1 -1.0 1.386294 4.000000 0.0\n", " | 10 location 1 1.0 -inf 0.444444 0.0\n", " | 11 machine 1 -1.0 1.386294 4.000000 0.0\n", " | 12 mars 1 1.0 -inf 0.444444 0.0\n", " | 13 my 1 1.0 -inf 0.444444 0.0\n", " | 14 of 1 1.0 -inf 0.444444 0.0\n", " | 15 the 1 1.0 -inf 0.444444 0.0\n", " | 16 this 1 1.0 -inf 0.444444 0.0\n", " | 17 time 1 1.0 -inf 0.444444 0.0\n", " | 18 to 1 1.0 -inf 0.444444 0.0\n", " | 19 want 1 1.0 -inf 0.444444 0.0\n", " | 20 with 1 1.0 -inf 0.444444 0.0\n", " | 21 without 1 -1.0 1.386294 4.000000 0.0\n", " | 22 you 1 -1.0 1.386294 4.000000 0.0\n", " | \n", " | \n", " | Notes\n", " | -----\n", " | Chi-square (CHI):\n", " | - It measures the lack of independence between t and c.\n", " | - It has a natural value of zero if t and c are independent. If it is higher, then term is dependent\n", " | - It is not reliable for low-frequency terms\n", " | - For multi-class categories, we will calculate X^2 value for all categories and will take the Max(X^2) value across all categories at the word level.\n", " | - It is not to be confused with chi-square test and the values returned are not significance values\n", " | \n", " | Mutual information (MI):\n", " | - Rare terms will have a higher score than common terms.\n", " | - For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.\n", " | \n", " | Proportional difference (PD):\n", " | - How close two numbers are from becoming equal. \n", " | - Helps find unigrams that occur mostly in one class of documents or the other\n", " | - We use the positive document frequency and negative document frequency of a unigram as the two numbers.\n", " | - If a unigram occurs predominantly in positive documents or predominantly in negative documents then the PD will be close to 1, however if distribution of unigram is almost similar, then PD is close to 0.\n", " | - We can set a threshold to decide which words to be included\n", " | - For multi-class categories, we will calculate PD value for all categories and will take the Max(PD) value across all categories at the word level.\n", " | \n", " | Information gain (IG):\n", " | - It gives discriminatory power of the word\n", " | \n", " | References\n", " | ----------\n", " | Yiming Yang and Jan O. Pedersen \"A Comparative Study on Feature Selection in Text Categorization\"\n", " | http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E5CC43FE63A1627AB4C0DBD2061FE4B9?doi=10.1.1.32.9956&rep=rep1&type=pdf\n", " | \n", " | Christine Largeron, Christophe Moulin, Mathias Géry \"Entropy based feature selection for text categorization\"\n", " | https://hal.archives-ouvertes.fr/hal-00617969/document\n", " | \n", " | Mondelle Simeon, Robert J. Hilderman \"Categorical Proportional Difference: A Feature Selection Method for Text Categorization\"\n", " | https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf\n", " | \n", " | Tim O`Keefe and Irena Koprinska \"Feature Selection and Weighting Methods in Sentiment Analysis\"\n", " | https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis\n", " | \n", " | Methods defined here:\n", " | \n", " | __init__(self, target, input_doc_list, stop_words=None, metric_list=['MI', 'CHI', 'PD', 'IG'])\n", " | Initialize self. See help(type(self)) for accurate signature.\n", " | \n", " | getScore(self)\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors defined here:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", " \n", " class TextFeatureSelectionGA(builtins.object)\n", " | TextFeatureSelectionGA(generations=500, population=50, prob_crossover=0.9, prob_mutation=0.1, percentage_of_token=50, runtime_minutes=120)\n", " | \n", " | Use genetic algorithm for selecting text tokens which give best classification results\n", " | \n", " | Genetic Algorithm Parameters\n", " | ----------\n", " | \n", " | generations : Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper\n", " | \n", " | population : Number of individual chromosomes. 50 as default, as used in the original paper\n", " | \n", " | prob_crossover : Probability of crossover. 0.9 as default, as used in the original paper\n", " | \n", " | prob_mutation : Probability of mutation. 0.1 as default, as used in the original paper\n", " | \n", " | percentage_of_token : Percentage of word features to be included in a given chromosome.\n", " | 50 as default, as used in the original paper.\n", " | \n", " | runtime_minutes : Number of minutes to run the algorithm. This is checked in between generations.\n", " | At start of each generation it is checked if runtime has exceeded than alloted time.\n", " | If case run time did exceeds provided limit, best result from generations executed so far is given as output.\n", " | Default is 2 hours. i.e. 120 minutes.\n", " | \n", " | References\n", " | ----------\n", " | Noria Bidi and Zakaria Elberrichi \"Feature Selection For Text Classification Using Genetic Algorithms\"\n", " | https://ieeexplore.ieee.org/document/7804223\n", " | \n", " | Methods defined here:\n", " | \n", " | __init__(self, generations=500, population=50, prob_crossover=0.9, prob_mutation=0.1, percentage_of_token=50, runtime_minutes=120)\n", " | Initialize self. See help(type(self)) for accurate signature.\n", " | \n", " | getGeneticFeatures(self, doc_list, label_list, model=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " | intercept_scaling=1, l1_ratio=None, max_iter=100,\n", " | multi_class='auto', n_jobs=None, penalty='l2',\n", " | random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n", " | warm_start=False), model_metric='f1', avrg='binary', analyzer='word', min_df=2, max_df=1.0, stop_words=None, tokenizer=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b', lowercase=True)\n", " | Data Parameters\n", " | ---------- \n", " | doc_list : text documents in a python list. \n", " | Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']\n", " | \n", " | label_list : labels in a python list.\n", " | Example: ['Neutral','Neutral','Positive','Negative']\n", " | \n", " | \n", " | Modelling Parameters\n", " | ----------\n", " | model : Set a model which has .fit function to train model and .predict function to predict for test data. \n", " | This model should also be able to train classifier using TfidfVectorizer feature.\n", " | Default is set as Logistic regression in sklearn\n", " | \n", " | model_metric : Classifier cost function. Select one from: ['f1','precision','recall'].\n", " | Default is F1\n", " | \n", " | avrg : Averaging used in model_metric. Select one from ['micro', 'macro', 'samples','weighted', 'binary'].\n", " | For binary classification, default is 'binary' and for multi-class classification, default is 'micro'.\n", " | \n", " | \n", " | TfidfVectorizer Parameters\n", " | ----------\n", " | analyzer : {'word', 'char', 'char_wb'} or callable, default='word'\n", " | Whether the feature should be made of word or character n-grams.\n", " | Option 'char_wb' creates character n-grams only from text inside\n", " | word boundaries; n-grams at the edges of words are padded with space.\n", " | \n", " | min_df : float or int, default=2\n", " | When building the vocabulary ignore terms that have a document\n", " | frequency strictly lower than the given threshold. This value is also\n", " | called cut-off in the literature.\n", " | If float in range of [0.0, 1.0], the parameter represents a proportion\n", " | of documents, integer absolute counts.\n", " | This parameter is ignored if vocabulary is not None.\n", " | \n", " | max_df : float or int, default=1.0\n", " | When building the vocabulary ignore terms that have a document\n", " | frequency strictly higher than the given threshold (corpus-specific\n", " | stop words).\n", " | If float in range [0.0, 1.0], the parameter represents a proportion of\n", " | documents, integer absolute counts.\n", " | This parameter is ignored if vocabulary is not None.\n", " | \n", " | stop_words : {'english'}, list, default=None\n", " | If a string, it is passed to _check_stop_list and the appropriate stop\n", " | list is returned. 'english' is currently the only supported string\n", " | value.\n", " | There are several known issues with 'english' and you should\n", " | consider an alternative (see :ref:`stop_words`).\n", " | \n", " | If a list, that list is assumed to contain stop words, all of which\n", " | will be removed from the resulting tokens.\n", " | Only applies if ``analyzer == 'word'``.\n", " | \n", " | If None, no stop words will be used. max_df can be set to a value\n", " | in the range [0.7, 1.0) to automatically detect and filter stop\n", " | words based on intra corpus document frequency of terms.\n", " | \n", " | tokenizer : callable, default=None\n", " | Override the string tokenization step while preserving the\n", " | preprocessing and n-grams generation steps.\n", " | Only applies if ``analyzer == 'word'``\n", " | \n", " | token_pattern : str, default=r\"(?u)\\b\\w\\w+\\b\"\n", " | Regular expression denoting what constitutes a \"token\", only used\n", " | if ``analyzer == 'word'``. The default regexp selects tokens of 2\n", " | or more alphanumeric characters (punctuation is completely ignored\n", " | and always treated as a token separator).\n", " | \n", " | If there is a capturing group in token_pattern then the\n", " | captured group content, not the entire match, becomes the token.\n", " | At most one capturing group is permitted.\n", " | \n", " | lowercase : bool, default=True\n", " | Convert all characters to lowercase before tokenizing.\n", " | \n", " | ----------------------------------------------------------------------\n", " | Data descriptors defined here:\n", " | \n", " | __dict__\n", " | dictionary for instance variables (if defined)\n", " | \n", " | __weakref__\n", " | list of weak references to the object (if defined)\n", "\n", "FILE\n", " /usr/local/lib/python3.7/site-packages/TextFeatureSelection.py\n", "\n", "\n" ] } ], "source": [ "help(TextFeatureSelection)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
word listword occurence countProportional DifferenceMutual InformationChi SquareInformation Gain
000103-0.0097090.0949592.4632820.004326
10001490.0738260.0086050.1501910.000266
20000001-1.0000000.7784451.1855380.001507
3000131.000000-inf2.5954830.000000
400016311.000000-inf0.8542100.000000
.....................
42232¾gallons1-1.0000000.7784451.1855380.001507
42233¾ths1-1.0000000.7784451.1855380.001507
42234ægis11.000000-inf0.8542100.000000
42235æration1-1.0000000.7784451.1855380.001507
42236œsophagus1-1.0000000.7784451.1855380.001507
\n", "

42237 rows × 6 columns

\n", "
" ], "text/plain": [ " word list word occurence count Proportional Difference \\\n", "0 00 103 -0.009709 \n", "1 000 149 0.073826 \n", "2 000000 1 -1.000000 \n", "3 0001 3 1.000000 \n", "4 000163 1 1.000000 \n", "... ... ... ... \n", "42232 ¾gallons 1 -1.000000 \n", "42233 ¾ths 1 -1.000000 \n", "42234 ægis 1 1.000000 \n", "42235 æration 1 -1.000000 \n", "42236 œsophagus 1 -1.000000 \n", "\n", " Mutual Information Chi Square Information Gain \n", "0 0.094959 2.463282 0.004326 \n", "1 0.008605 0.150191 0.000266 \n", "2 0.778445 1.185538 0.001507 \n", "3 -inf 2.595483 0.000000 \n", "4 -inf 0.854210 0.000000 \n", "... ... ... ... \n", "42232 0.778445 1.185538 0.001507 \n", "42233 0.778445 1.185538 0.001507 \n", "42234 -inf 0.854210 0.000000 \n", "42235 0.778445 1.185538 0.001507 \n", "42236 0.778445 1.185538 0.001507 \n", "\n", "[42237 rows x 6 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from TextFeatureSelection import TextFeatureSelection # import TextFeatureSelection\n", "fsOBJ=TextFeatureSelection(target=labels,input_doc_list=corpus) # compute features\n", "df=fsOBJ.getScore() # get features as a dataframe\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `pandas.DataFrame` is similar to an Excel speadsheet. It contain several columns which we can use for selecting and sorting information. In fact, if you are familiar with Excel, you can export the data frame and open it as a spreadsheet. The code below takes care of this." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "df.to_excel('data/result_features.xlsx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to know more about working with DataFrames, consul the following notebooks.\n", "- [Exploring DataFrames (Part I)](8_-_Data_Exploration_with_Pandas_I.ipynb)\n", "- [Exploring DataFrames (Part II)](9_-_Data_Exploration_with_Pandas_Part_II.ipynb)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the following columns to select and rank words:\n", "- **Word occurence count**: How often a term occurs in the corpus\n", "- **Proportional Difference**: It helps find unigrams that occur mostly in one class of documents or the other.\"\n", "- **Mutual Information**: The discriminatory power of a word." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
word listword occurence countProportional DifferenceMutual InformationChi SquareInformation Gain
21070horseferry710.971831-3.484235102.3137620.239720
41176wes640.968750-3.38043884.8404380.205713
30188pimlico630.968254-3.36469082.5524510.201071
7989await640.906250-2.28182673.3054330.165213
9838buckingham610.901639-2.23381766.9749890.152432
20355harrison630.873016-1.97839665.7676270.144732
33224restaurant580.896552-2.18338661.0251470.140108
18111fines910.692308-1.09335779.8509900.139906
10445carpentry470.957447-3.07170351.5094040.133109
25739marshall460.956522-3.05019749.8618080.129219
\n", "
" ], "text/plain": [ " word list word occurence count Proportional Difference \\\n", "21070 horseferry 71 0.971831 \n", "41176 wes 64 0.968750 \n", "30188 pimlico 63 0.968254 \n", "7989 await 64 0.906250 \n", "9838 buckingham 61 0.901639 \n", "20355 harrison 63 0.873016 \n", "33224 restaurant 58 0.896552 \n", "18111 fines 91 0.692308 \n", "10445 carpentry 47 0.957447 \n", "25739 marshall 46 0.956522 \n", "\n", " Mutual Information Chi Square Information Gain \n", "21070 -3.484235 102.313762 0.239720 \n", "41176 -3.380438 84.840438 0.205713 \n", "30188 -3.364690 82.552451 0.201071 \n", "7989 -2.281826 73.305433 0.165213 \n", "9838 -2.233817 66.974989 0.152432 \n", "20355 -1.978396 65.767627 0.144732 \n", "33224 -2.183386 61.025147 0.140108 \n", "18111 -1.093357 79.850990 0.139906 \n", "10445 -3.071703 51.509404 0.133109 \n", "25739 -3.050197 49.861808 0.129219 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "westminster_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] > 0 )]\n", "westminster_df.sort_values('Information Gain',ascending=False)[:10]" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
word listword occurence countProportional DifferenceMutual InformationChi SquareInformation Gain
30606pop59-1.0000000.778445110.5158900.184282
15282dock66-0.7878790.66632785.9112160.149069
22759intimations92-0.4782610.47616468.9339360.146510
22037india67-0.7611940.65129082.8334720.144441
31245procured70-0.7142860.62429479.7802180.141947
41149wellington86-0.5116280.49848566.3994550.131704
17897ferry68-0.7058820.61938074.2056240.129299
34547seamen65-0.7230770.62940971.6988920.122070
14529devons47-1.0000000.77844578.6054310.118591
26460millwall49-0.9591840.75782577.2625010.118218
\n", "
" ], "text/plain": [ " word list word occurence count Proportional Difference \\\n", "30606 pop 59 -1.000000 \n", "15282 dock 66 -0.787879 \n", "22759 intimations 92 -0.478261 \n", "22037 india 67 -0.761194 \n", "31245 procured 70 -0.714286 \n", "41149 wellington 86 -0.511628 \n", "17897 ferry 68 -0.705882 \n", "34547 seamen 65 -0.723077 \n", "14529 devons 47 -1.000000 \n", "26460 millwall 49 -0.959184 \n", "\n", " Mutual Information Chi Square Information Gain \n", "30606 0.778445 110.515890 0.184282 \n", "15282 0.666327 85.911216 0.149069 \n", "22759 0.476164 68.933936 0.146510 \n", "22037 0.651290 82.833472 0.144441 \n", "31245 0.624294 79.780218 0.141947 \n", "41149 0.498485 66.399455 0.131704 \n", "17897 0.619380 74.205624 0.129299 \n", "34547 0.629409 71.698892 0.122070 \n", "14529 0.778445 78.605431 0.118591 \n", "26460 0.757825 77.262501 0.118218 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]\n", "poplar_df.sort_values('Information Gain',ascending=False)[:10]" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
word listword occurence countProportional DifferenceMutual InformationChi SquareInformation Gain
30606pop59-1.0000000.778445110.5158900.184282
9432bow89-0.6404490.580268106.1523390.000000
42219zymotic94-0.5531910.52560993.3269420.000000
15282dock66-0.7878790.66632785.9112160.149069
22037india67-0.7611940.65129082.8334720.144441
31245procured70-0.7142860.62429479.7802180.141947
14529devons47-1.0000000.77844578.6054310.118591
26460millwall49-0.9591840.75782577.2625010.118218
33936ruston46-1.0000000.77844576.2521520.114306
17897ferry68-0.7058820.61938074.2056240.129299
\n", "
" ], "text/plain": [ " word list word occurence count Proportional Difference \\\n", "30606 pop 59 -1.000000 \n", "9432 bow 89 -0.640449 \n", "42219 zymotic 94 -0.553191 \n", "15282 dock 66 -0.787879 \n", "22037 india 67 -0.761194 \n", "31245 procured 70 -0.714286 \n", "14529 devons 47 -1.000000 \n", "26460 millwall 49 -0.959184 \n", "33936 ruston 46 -1.000000 \n", "17897 ferry 68 -0.705882 \n", "\n", " Mutual Information Chi Square Information Gain \n", "30606 0.778445 110.515890 0.184282 \n", "9432 0.580268 106.152339 0.000000 \n", "42219 0.525609 93.326942 0.000000 \n", "15282 0.666327 85.911216 0.149069 \n", "22037 0.651290 82.833472 0.144441 \n", "31245 0.624294 79.780218 0.141947 \n", "14529 0.778445 78.605431 0.118591 \n", "26460 0.757825 77.262501 0.118218 \n", "33936 0.778445 76.252152 0.114306 \n", "17897 0.619380 74.205624 0.129299 " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]\n", "poplar_df.sort_values('Chi Square',ascending=False)[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fin." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }