{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generating Concordances\n",
"\n",
"This notebook shows how you can generate a concordance using lists."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we see what text files we have. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hume Enquiry.txt negative.txt positive.txt\r\n",
"Hume Treatise.txt obama_tweets.txt\r\n"
]
}
],
"source": [
"ls *.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use the \"Hume Enquiry.txt\" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This string has 1344061 characters.\n",
"The Project Gutenberg EBook of A Treatise of Human\n"
]
}
],
"source": [
"theText2Use = \"Hume Treatise.txt\"\n",
"with open(theText2Use, \"r\") as fileToRead:\n",
" fileRead = fileToRead.read()\n",
" \n",
"print(\"This string has\", len(fileRead), \"characters.\")\n",
"print(fileRead[:50])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenization\n",
"\n",
"Now we tokenize the text producing a list called \"listOfTokens\" and check the first words. This eliminates punctuation and lowercases the words."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['the', 'project', 'gutenberg', 'ebook', 'of', 'a', 'treatise', 'of', 'human', 'nature']\n"
]
}
],
"source": [
"import re\n",
"listOfTokens = re.findall(r'\\b\\w[\\w-]*\\b', fileRead.lower())\n",
"print(listOfTokens[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input\n",
"\n",
"Now we get the word you want a concordance for and the context wanted."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"What word do you want collocates for? truth\n",
"How much context do you want? 10\n"
]
}
],
"source": [
"word2find = input(\"What word do you want collocates for? \").lower() # Ask for the word to search for\n",
"context = input(\"How much context do you want? \")# This asks for the context of words on either side to grab"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"str"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(context)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"int"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"contextInt = int(context)\n",
"type(contextInt)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"228958"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(listOfTokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Main function\n",
"\n",
"Here is the main function that does the work populating a new list with the lines of concordance. We check the first 5 concordance lines."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['220330: a reason why the faculty of recalling past ideas with truth and clearness should not have as much merit in it',\n",
" '223214: confessing my errors and should esteem such a return to truth and reason to be more honourable than the most unerring',\n",
" '223680: from the other this therefore being regarded as an undoubted truth that belief is nothing but a peculiar feeling different from',\n",
" '224382: mind and he will evidently find this to be the truth secondly whatever may be the case with regard to this',\n",
" '225925: by their different feeling i should have been nearer the truth end of project gutenberg s a treatise of human nature']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def makeConc(word2conc,list2FindIn,context2Use,concList):\n",
"\n",
" end = len(list2FindIn)\n",
" for location in range(end):\n",
" if list2FindIn[location] == word2conc:\n",
" # Here we check whether we are at the very beginning or end\n",
" if (location - context2Use) < 0:\n",
" beginCon = 0\n",
" else:\n",
" beginCon = location - context2Use\n",
" \n",
" if (location + context2Use) > end:\n",
" endCon = end\n",
" else:\n",
" endCon = location + context2Use + 1\n",
" \n",
" theContext = (list2FindIn[beginCon:endCon])\n",
" concordanceLine = ' '.join(theContext)\n",
" # print(str(location) + \": \" + concordanceLine)\n",
" concList.append(str(location) + \": \" + concordanceLine)\n",
"\n",
"theConc = []\n",
"makeConc(word2find,listOfTokens,int(context),theConc)\n",
"theConc[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Output\n",
"\n",
"Finally, we output to a text file."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done\n"
]
}
],
"source": [
"nameOfResults = word2find.capitalize() + \".Concordance.txt\"\n",
"\n",
"with open(nameOfResults, \"w\") as fileToWrite:\n",
" for line in theConc:\n",
" fileToWrite.write(line + \"\\n\")\n",
" \n",
"print(\"Done\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we check that the file was created."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Truth.Concordance.txt\r\n"
]
}
],
"source": [
"ls *.Concordance.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next Steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Onwards to our final utility example [Exploring a text with NLTK](Exploring a text with NLTK.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](../ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com). Edited and revised by [Melissa Mony](http://melissamony.com).
Created September 30th, 2016 (Jupyter 4.2.1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}