{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />\n",
    "**For questions/comments/improvements, email nathan.kelber@ithaka.org.**<br />\n",
    "![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n",
    "____\n",
    "# Exploring Word Frequencies\n",
    "\n",
    "**Description of methods in this notebook:**\n",
    "This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) shows how to explore the [word frequencies](https://docs.tdm-pilot.org/key-terms/#word-frequency) of your [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n",
    "\n",
    "* Converting your [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico)[dataset](https://docs.tdm-pilot.org/key-terms/#dataset) into a Python list\n",
    "* Creating a raw word frequency count\n",
    "* Creating and modifying a [stop words list](https://docs.tdm-pilot.org/key-terms/#stop-words)\n",
    "* Cleaning up the [corpus](https://docs.tdm-pilot.org/key-terms/#corpus)\n",
    "* Create a new word frequency list focused on [content words](https://docs.tdm-pilot.org/key-terms/#content-words)\n",
    "\n",
    "**Difficulty:** Intermediate\n",
    "\n",
    "**Purpose:** Learning (Optimized for explanation over code)\n",
    "\n",
    "**Knowledge Required:** \n",
    "* [Python Basics I](./0-python-basics-1.ipynb)\n",
    "* [Python Basics II](./0-python-basics-2.ipynb)\n",
    "* [Python Basics III](./0-python-basics-3.ipynb)\n",
    "\n",
    "**Knowledge Recommended:**\n",
    "* [Exploring Metadata](https://docs.tdm-pilot.org/exploring-metadata/)\n",
    "* A familiarity with [The Natural Language Toolkit](https://docs.tdm-pilot.org/key-terms/#nltk) and [Counter objects](https://docs.tdm-pilot.org/key-terms/#python-counter) is helpful\n",
    "\n",
    "**Completion time:** 60 minutes\n",
    "\n",
    "**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n",
    "\n",
    "**Libraries Used:**\n",
    "* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list\n",
    "* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n",
    "* **Counter** from the **Collections** module to help sum up our word frequencies\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import your dataset\n",
    "\n",
    "You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Importing your dataset with a dataset ID\n",
    "import tdm_client\n",
    "#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n",
    "tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we can begin working with our [dataset](https://docs.tdm-pilot.org/key-terms/#dataset), we need to convert the [JSON lines](https://docs.tdm-pilot.org/key-terms/#jsonl) file format into [Python](https://docs.tdm-pilot.org/key-terms/#python) so we can work with it. Remember that each line of our [JSON lines](https://docs.tdm-pilot.org/key-terms/#jsonl) file represents a single text, whether that is a journal article, book, or something else. We will create a [Python](https://docs.tdm-pilot.org/key-terms/#python) list that contains every document. Within each list item for each document, we will use a [Python dictionary](https://docs.tdm-pilot.org/key-terms/#python-dictionary) of [key/value pairs](https://docs.tdm-pilot.org/key-terms/#key-value-pair) to store information related to that document. [Read more about the dataset format](https://docs.tdm-pilot.org/what-format-are-jstor-portico-datasets/).\n",
    "\n",
    "Essentially we will have a [list](https://docs.tdm-pilot.org/key-terms/#python-list) of documents numbered, from zero to the last document. Each [list](https://docs.tdm-pilot.org/key-terms/#python-list) item then will be composed of a [dictionary](https://docs.tdm-pilot.org/key-terms/#python-dictionary) of [key/value pairs](https://docs.tdm-pilot.org/key-terms/#key-value-pair) that allows us to retrieve information from that particular document by number. The structure will look something like this:\n",
    "\n",
    "![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)\n",
    "\n",
    "For each item in our list we will be able to use [key/value pairs](https://docs.tdm-pilot.org/key-terms/#key-value-pair) to get a **value** if we supply a **key**. We will call our [Python list](https://docs.tdm-pilot.org/key-terms/#python-list) variable `all_documents` since it will contain all of the documents in our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The tdm_client uses a default `file_name` of `sampleJournalAnalysis.jsonl`.\n",
    "# Unless you have uploaded your own dataset, this code cell requires no modification before running.\n",
    "file_name = 'sampleJournalAnalysis.jsonl' \n",
    "\n",
    "# Import the json module\n",
    "import json\n",
    "# Create an empty new list variable named `all_documents`\n",
    "all_documents = [] \n",
    "# Temporarily open the file `filename` in the datasets/ folder\n",
    "with open('./datasets/' + file_name) as dataset_file: \n",
    "    #for each line in the dataset file\n",
    "    for line in dataset_file: \n",
    "        # Read each line into a Python dictionary.\n",
    "        # Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary\n",
    "        document = json.loads(line) \n",
    "        # Append a new list item to `all_documents` containing the dictionary we created.\n",
    "        all_documents.append(document) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now all of our documents have been converted from our original [JSON lines](https://docs.tdm-pilot.org/key-terms/#jsonl) file format (.jsonl) into a [python List](https://docs.tdm-pilot.org/key-terms/#python-list) variable named `all_documents`. Let's see what we can discover about our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) with a few simple methods.\n",
    "\n",
    "First, we can determine how many texts are in our [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) by using the `len()` function to get the size of `all_documents`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "len(all_documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Choosing a Document\n",
    "We will create a new variable called `chosen_document` and set it equal to the first [list](https://docs.tdm-pilot.org/key-terms/#python-list) item in `all_documents`. (Remember, in Python lists, 0 is the first item, 1 is the second item, 2 is the third item, etc.)\n",
    "\n",
    "We'll also use the `.get()` method to retrieve some information about the item and print it here. This will help us check to make sure this is a suitable article. If it is front matter or back matter, for example, you may want to select another article. You can achieve that by changing the index number in the first line of code. For example, you might change `all_documents[0]` (the first article in the list) to `all_documents[5]` (the sixth article in the list).\n",
    "\n",
    "If you're looking at a JSTOR corpus document, you can also follow the URL to preview it. This will help you determine if it's a good example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chosen_document = all_documents[0] # Create a dictionary variable that contains the first document from all_documents. Change 0 if you want another document.\n",
    "print(chosen_document.get('title')) # Get the value for the key title for `chosen_document` and print it\n",
    "print('written in ' + str(chosen_document.get('isPartOf'))) # Print 'written in' and the journal value stored in the key 'isPartOf'\n",
    "#print(str(chosen_document.get('publicationYear')) + ', Volume ' + chosen_document.get('volumeNumber')) # Print the value of the key `publicationYear` and `volumeNumber`\n",
    "print('URL is: ' + chosen_document.get('url')) # Print 'URL is: ' and the value for the key 'url' in `chosen_document`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's examine the word counts from the `chosen_document`. First, we create a new variable `word_counts` that will contain the word counts from our `chosen_document`. These are stored as [Python dictionary variables](https://docs.tdm-pilot.org/key-terms/#python-dictionary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_counts = chosen_document.get('unigramCount')\n",
    "dict(list(word_counts.items())[:10]) #This code previews the first 10 items in your dictionary\n",
    "# It does this by turning the `word_counts` dictionary into a list and then shows 10 items (and then turns it back into a dictionary)\n",
    "# We could also use a for loop to show all the keys and values using the items() method in the word_counts dictionary\n",
    "#for k,v in word_counts.items():\n",
    "#    print(k + ': ' + str(v))\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## An Explanation of the Counter Container Datatype\n",
    "\n",
    "In order to help analyze our dictionary, we are going to use a special container datatype called a Counter. A Counter is like a dictionary. In fact, it uses brackets `{}` like a dictionary. Here's an example where we turn a dictionary (`dictionaryDemo`) into a Counter (`counterDemo`) in order to explore the difference between the two:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter # Import Counter datatype\n",
    "dictionary_demo = {\"Random\": 23,\n",
    "                  'Words': 3,\n",
    "                  'For': 4,\n",
    "                  'The': 4,\n",
    "                  'Example': 553} # Create example dictionary with key/value pairs of words and numbers\n",
    "counter_demo = Counter(dictionary_demo) # Turn the dictionary into a counter\n",
    "print(counter_demo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the Counter type looks identical to a dictionary with key/value pairs within {} that is surrounded by the parentheses in `Counter()`. Both dictionaries and counters can return a **value** from a **key**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dictionary_demo['Random']) # Using the Python dictionary `dictionary_demo`, return the value for the key 'Random'\n",
    "print(counter_demo['Random']) #Using the Python counter `counter_demo`, return the value for the key 'Random'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, the `Counter()` has some helpful differences from a dictionary. One difference is that a `Counter()` returns a 0 when no such key exists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(counter_demo['no_such_key_exists']) # With a Counter, the value of the made-up key `no_such_key_exists` is 0. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a key is not in a dictionary, it returns a KeyError."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#print(dictionary_demo['no_such_key_exists']) # With a dictionary, the value of the made-up key `no_such_key_exists` causes a KeyError in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While this is useful, we already have a similar functionality using the `get()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A demonstration of returning a string when no such key exists\n",
    "print(dictionary_demo.get('no_such_key_exists')) # If no key is found, `None` is returned\n",
    "print(counter_demo.get('no_such_key_exists', 'No such key')) # We can also supply a second argument that defines a string to be returned"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our purposes, the most useful aspect of the `Counter()` datatype is that it lets us easily return the most common items through the `most_common()` method. We can specify an argument with this method to receive a certain number of results. Let's try it on our example `counter_demo`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter_demo.most_common(3) # Print the top 3 most common items in `counter_demo`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Using Counter to Sum the Words in a Single Article\n",
    "\n",
    "Let's return then to the dictionary we created to hold all the words in our article. We called that variable `word_counts`. We can get a preview of the first 10 words in our dictionary using the code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict(list(word_counts.items())[:10]) #This code previews the first 10 items in the dictionary\n",
    "# It does this by turning the `word_counts` dictionary into a list and then shows 10 items (and then turns it back into a dictionary)\n",
    "# We could also use a for loop to show all the keys and values using the items() method in the word_counts dictionary\n",
    "#for k,v in word_counts.items():\n",
    "#    print(k + ': ' + str(v))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, the key/value pairs may not be in order from most frequent to least frequent words. We can sort by most frequent words by turning our dictionary `word_counts` into a Counter and then using the `most_common()` method. Let's call our new Counter object `counter_word_counts` and then print out the top 30 most common words. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter_word_counts = Counter(word_counts) # Create `counter_word_counts` that will be Counter datatype version of our original `word_counts` dictionary\n",
    "for key, value in counter_word_counts.most_common(30): # For each key/value pair in counter_word_count's top 30 most common words\n",
    "    print(key.ljust(15), value) #print the `key` left justified 15 characters from the `value` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have successfully created a word frequency list. There are a couple small issues, however, that we still need to address:\n",
    "1. There are many [function words](https://docs.tdm-pilot.org/key-terms/#function-words), words like \"the\", \"in\", and \"of\" that are grammatically important but do not carry as much semantic meaning like [content words](https://docs.tdm-pilot.org/key-terms/#content-words), such as nouns and verbs. \n",
    "2. The words represented here are actually case-sensitive [strings](https://docs.tdm-pilot.org/key-terms/#string). That means that the string \"the\" is a different from the string \"The\". You may notice this in your results above.\n",
    "\n",
    "To solve these issues, we need to find a way to remove common [function words](https://docs.tdm-pilot.org/key-terms/#function-words) and combine [strings](https://docs.tdm-pilot.org/key-terms/#string) that may have capital letters in them. We can solve these issues by:\n",
    "\n",
    "1. Using a [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) list to remove common [function words](https://docs.tdm-pilot.org/key-terms/#function-words)\n",
    "2. Lowercasing all the characters in each string to combine our counts\n",
    "\n",
    "We could create our own stopwords list, but luckily there are many examples out there already. We'll use NLTK's [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) list to get started.\n",
    "\n",
    "First, we create a new list variable `stop_words` and initialize it with the common English [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) from the [Natural Language Toolkit](https://docs.tdm-pilot.org/key-terms/#nltk) library. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating a stop_words list from the NLTK. We could also use the set of stopwords from Spacy or Gensim.\n",
    "from nltk.corpus import stopwords #import stopwords from nltk.corpus\n",
    "stop_words = stopwords.words('english') #create a list `stop_words` that contains the English stop words list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're curious what is in our stopwords list, we can print a slice of the first ten words in our list to get a preview."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words[:10] #print the first 10 stop words in the list\n",
    "#list(stop_words) #show the whole stopwords list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It may be that we want to add additional words to our stoplist. For example, we may want to remove character names. We can add items to the list by using the append method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words.append(\"octopus\")\n",
    "stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also add multiple words to our stoplist by using the extend() method. Notice that this method requires using a set of brackets `[]` to clarify that we are adding \"gertrude\" and \"horatio\" as list items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words.extend([\"kangaroo\", \"lemur\"])\n",
    "stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also remove words from our list with the remove() method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words.remove(\"octopus\")\n",
    "stop_words.remove(\"kangaroo\")\n",
    "stop_words.remove(\"lemur\")\n",
    "# Or to remove the last three words:\n",
    "# del stop_words[-3:]\n",
    "stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Storing Stopwords in a CSV File\n",
    "We could also store our stop words in a CSV file. A CSV, or \"Comma-Separated Values\" file, is a plain-text file with commas separating each entry. The file could be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets. Here's what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.\n",
    "\n",
    "![The csv file as an image](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/stopwordsCSV.png)\n",
    "\n",
    "Let's create an example CSV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a CSV file to store a set of stopwords\n",
    "\n",
    "import csv # Import the csv module to work with csv files\n",
    "outputFile = open('stop_words.csv', 'w', newline='') # Create a variable `outputFile` that will be linked to a new csv file called stop_words.csv\n",
    "outputWriter = csv.writer(outputFile) # Create a writer object to add to our `outputFile`\n",
    "outputWriter.writerow(stop_words) # Add our list `stop_words` to the CSV file\n",
    "outputFile.close() # Close the CSV file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have created a new file called stopWords.csv that you can open to modify. Go ahead and make a change to your stopWords.csv (either adding or subtracting words). Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select \"Open With > Editor.\" \n",
    "\n",
    "![Selecting \"Open With > Editor\" in Jupyter Lab](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/editCSV.png)\n",
    "\n",
    "Now go ahead and add in a new word. Remember a few things:\n",
    "\n",
    "* Each word is separated from the next word by a comma.\n",
    "* There are no spaces between the words.\n",
    "* You must save changes to the file if you're using a text editor, Excel, or the Jupyter Lab editor.\n",
    "* You can reopen the file to make sure your changes were saved.\n",
    "\n",
    "Now let's read our CSV file back and overwrite our original `stop_words` list variable. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open the CSV file and list the contents\n",
    "\n",
    "new_stopwords_file = open('stop_words.csv') # Open `stopWords.csv` as the variable newStopwordsFile\n",
    "new_stopwords_reader = csv.reader(new_stopwords_file) # Create newStopwordsReader variable to open the newStopwordsFile in Reader Mode\n",
    "stop_words = list(new_stopwords_reader)[0] # Define the stop_words variable as a list to the contents of newStopwordsReader\n",
    "stop_words[-10:] # Return the last ten items of the list stop_words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Refining a stopwords list for your analysis can take time. It depends on:\n",
    "\n",
    "* What you are hoping to discover (for example, are function words important?)\n",
    "* The material you are analyzing (for example, journal articles may repeat words like \"abstract\")\n",
    "\n",
    "If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to refine a good stopword list.\n",
    "___\n",
    "## Cleaning and Standardizing Tokens\n",
    "\n",
    "We can standardize and [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up the [tokens](https://docs.tdm-pilot.org/key-terms/#token) in our [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) by creating a function that passes each token through a series of tests. The function will:\n",
    "* discard [tokens](https://docs.tdm-pilot.org/key-terms/#token) less than 4 characters in length\n",
    "* discard [tokens](https://docs.tdm-pilot.org/key-terms/#token) with non-alphabetical characters\n",
    "* lowercase all characters in each [token](https://docs.tdm-pilot.org/key-terms/#token)\n",
    "* remove [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) based on the list we created in `stop_words`\n",
    "\n",
    "Of course, depending on your analysis and goals, you may want to change one or more the tests."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A series of tests to see whether a token should be added to our final word count.\n",
    "# In order for a token to be added, it must pass all these tests.\n",
    "\n",
    "cleaned_word_counts = Counter() # define a new variable `cleaned_word_counts` that is an empty counter type. We will store our cleaned data in it.\n",
    "\n",
    "for token, count in counter_word_counts.items(): # For each key (`token`), value (`count`) pair in our cleaned_word_counts Counter, run the following tests...\n",
    "    if len(token) < 4: # If the token is less than four characters, restart the loop with the next token\n",
    "        continue\n",
    "    if not token.isalpha(): # If the token contains characters that are not from the alphabet, restart the loop over with the next token\n",
    "        continue\n",
    "    t = token.lower() # Define a variable `t` that is an all-lowercase version of the token\n",
    "    if t in stop_words: # If the token `t` is in our stop_words list, restart the loop over with the next token\n",
    "        continue\n",
    "    cleaned_word_counts[t] += count # Add `t` and `count` to `cleaned_word_counts`\n",
    "    \n",
    "print(cleaned_word_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting dictionary `clean_word_counts` contains only function words, lowercased, and greater than four characters. We can now print the top 25 most common words using the `most_common()` method for Counters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for key, value in cleaned_word_counts.most_common(25): # For the top 25 most common key/value pairs in `cleaned_word_counts`\n",
    "    print(key.ljust(15), value) # print the key (left-justified by 15 characters) followed by the value\n",
    "    # Remember that the key above corresponds to the token and the value corresponds to the number of times that token occurs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "# Start Next Lesson: [Finding Significant Terms using TF/IDF](./1-significant-terms.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}