{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Concordances\n", "\n", "This notebook shows how you can generate a concordance using lists." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we see what text files we have. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hume Enquiry.txt negative.txt positive.txt\r\n", "Hume Treatise.txt obama_tweets.txt\r\n" ] } ], "source": [ "ls *.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use the \"Hume Enquiry.txt\" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This string has 1344061 characters.\n", "The Project Gutenberg EBook of A Treatise of Human\n" ] } ], "source": [ "theText2Use = \"Hume Treatise.txt\"\n", "with open(theText2Use, \"r\") as fileToRead:\n", " fileRead = fileToRead.read()\n", " \n", "print(\"This string has\", len(fileRead), \"characters.\")\n", "print(fileRead[:50])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenization\n", "\n", "Now we tokenize the text producing a list called \"listOfTokens\" and check the first words. This eliminate punctuation and lowercases the words." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['the', 'project', 'gutenberg', 'ebook', 'of', 'a', 'treatise', 'of', 'human', 'nature']\n" ] } ], "source": [ "import re\n", "listOfTokens = re.findall(r'\\b\\w[\\w-]*\\b', fileRead.lower())\n", "print(listOfTokens[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input\n", "\n", "Now we get the word you want a concordance for an the context wanted." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "What word do you want collocates for? truth\n", "How much context do you want? 10\n" ] } ], "source": [ "word2find = input(\"What word do you want collocates for? \").lower() # Ask for the word to search for\n", "context = input(\"How much context do you want? \")# This asks for the context of words on either side to grab" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(context)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "int" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contextInt = int(context)\n", "type(contextInt)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "228958" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(listOfTokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Main function\n", "\n", "Here is the main function that does the work populating a new list with the lines of concordance. We check the first 5 concordance lines." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['220330: a reason why the faculty of recalling past ideas with truth and clearness should not have as much merit in it',\n", " '223214: confessing my errors and should esteem such a return to truth and reason to be more honourable than the most unerring',\n", " '223680: from the other this therefore being regarded as an undoubted truth that belief is nothing but a peculiar feeling different from',\n", " '224382: mind and he will evidently find this to be the truth secondly whatever may be the case with regard to this',\n", " '225925: by their different feeling i should have been nearer the truth end of project gutenberg s a treatise of human nature']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def makeConc(word2conc,list2FindIn,context2Use,concList):\n", "\n", " end = len(list2FindIn)\n", " for location in range(end):\n", " if list2FindIn[location] == word2conc:\n", " # Here we check whether we are at the very beginning or end\n", " if (location - context2Use) < 0:\n", " beginCon = 0\n", " else:\n", " beginCon = location - context2Use\n", " \n", " if (location + context2Use) > end:\n", " endCon = end\n", " else:\n", " endCon = location + context2Use + 1\n", " \n", " theContext = (list2FindIn[beginCon:endCon])\n", " concordanceLine = ' '.join(theContext)\n", " # print(str(location) + \": \" + concordanceLine)\n", " concList.append(str(location) + \": \" + concordanceLine)\n", "\n", "theConc = []\n", "makeConc(word2find,listOfTokens,int(context),theConc)\n", "theConc[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output\n", "\n", "Finally, we output to a text file." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done\n" ] } ], "source": [ "nameOfResults = word2find.capitalize() + \".Concordance.txt\"\n", "\n", "with open(nameOfResults, \"w\") as fileToWrite:\n", " for line in theConc:\n", " fileToWrite.write(line + \"\\n\")\n", " \n", "print(\"Done\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we check that the file was created." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Truth.Concordance.txt\r\n" ] } ], "source": [ "ls *.Concordance.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com)
Created September 30th, 2016 (Jupyter 4.2.1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }