{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n", "For questions/comments/improvements, email nathan.kelber@ithaka.org.
\n", "___\n", "**Creating a Stopwords List**\n", "\n", "**Description:**\n", "This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) explains what a stopwords list is and how to create one. The following processes are described:\n", "\n", "* Loading the NLTK stopwords list\n", "* Modifying the stopwords list in Python\n", "* Saving a stopwords list to a .csv file\n", "* Loading a stopwords list from a .csv file\n", "\n", "**Use Case:** For Learners (Detailed explanation, not ideal for researchers)\n", "\n", "**Difficulty:** Intermediate\n", "\n", "**Completion time:** 20 minutes\n", "\n", "**Knowledge Required:** \n", "* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))\n", "\n", "**Knowledge Recommended:** None\n", "\n", "**Data Format:** CSV files\n", "\n", "**Libraries Used:**\n", "* **[nltk](https://docs.tdm-pilot.org/key-terms/#nltk)** to create an initial stopwords list\n", "* **csv** to read and write the stopwords to a file\n", "\n", "**Research Pipeline:** None\n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Purpose of a Stopwords List\n", "\n", "Many text analytics techniques are based on counting the occurrence of words in a given text or set of texts (called a corpus). The most frequent words can reveal general textual patterns, but the most frequent words for any given text in English tend to look very similar to this:\n", "\n", "\n", "|Word|Frequency|\n", "|---|---|\n", "|the| 1,160,276|\n", "|of|906,898|\n", "|and|682,419|\n", "|in|461,328|\n", "|to|418,017|\n", "|a|334,082|\n", "|is|214,663|\n", "|that|204,277|\n", "|by|181,605|\n", "|as|177,774|\n", "\n", "There are many [function words](https://docs.tdm-pilot.org/key-terms/#function-words), words like \"the\", \"in\", and \"of\" that are grammatically important but do not carry as much semantic meaning in comparison to [content words](https://docs.tdm-pilot.org/key-terms/#content-words), such as nouns and verbs.\n", "\n", "For this reason, many analysts remove common [function words](https://docs.tdm-pilot.org/key-terms/#function-words) using a [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) list. There are many sources for stopwords lists. (We'll use the Natural Language Toolkit stopwords list in this lesson.) **There is no official, standardized stopwords list for text analysis.**\n", "\n", "An effective stopwords list depends on:\n", "\n", "* the texts being analyzed\n", "* the purpose of the analysis\n", "\n", "Even if we remove all common function words, there are often formulaic repetitions in texts that may be counter-productive for the research goal.**The researcher is responsible for making educated decisions about whether or not to include any particular stopword given the research context.**\n", "\n", "Here are a few examples where additional stopwords may be necessary:\n", "\n", "* A corpus of law books is likely to have formulaic, archaic repetition, such as, \"hereunto this law is enacted...\"\n", "* A corpus of dramatic plays is likely to have speech markers for each line, leading to an over-representation of character names (Hamlet, Gertrude, Laertes, etc.)\n", "* A corpus of emails is likely to have header language (to, from, cc, bcc), technical language (attached, copied, thread, chain) and salutations (attached, best, dear, cheers, etc.)\n", "\n", "Because every research project may require unique stopwords, it is important for researchers to learn to create and modify stopwords lists." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Examining the NLTK Stopwords List\n", "\n", "The Natural Language Toolkit Stopwords list is well-known and a natural starting point for creating your own list. Let's take a look at what it contains before learning to make our own modifications.\n", "\n", "We will store our stopwords in a Python list variable called `stop_words`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating a stop_words list from the NLTK. We could also use the set of stopwords from Spacy or Gensim.\n", "from nltk.corpus import stopwords # Import stopwords from nltk.corpus\n", "stop_words = stopwords.words('english') # Create a list `stop_words` that contains the English stop words list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're curious what is in our stopwords list, we can use the `print()` or `list()` functions to find out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(stop_words) # Show each string in our stopwords list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Storing Stopwords in a CSV File\n", "Storing the stopwords list in a variable like `stop_words` is useful for analysis, but we will likely want to keep the list even after the session is over for future changes and analyses. We can store our stop words list in a CSV file. A CSV, or \"Comma-Separated Values\" file, is a plain-text file with commas separating each entry. The file can be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets.\n", "\n", "Here's what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.\n", "\n", "![The csv file as an image](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/stopwordsCSV.png)\n", "\n", "Let's create an example CSV using the `csv` module." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a CSV file to store a set of stopwords\n", "\n", "import csv # Import the csv module to work with csv files\n", "with open('data/stop_words.csv', 'w', newline='') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(stop_words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have created a new file called data/stop_words.csv that you can open and modify using a basic text editor. Go ahead and make a change to your data/stop_words.csv (either adding or subtracting words) using a text editor. Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select \"Open With > Editor.\" \n", "\n", "![Selecting \"Open With > Editor\" in Jupyter Lab](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/editCSV.png)\n", "\n", "Now go ahead and add in a new word. Remember a few things:\n", "\n", "* Each word is separated from the next word by a comma.\n", "* There are no spaces between the words.\n", "* You must save changes to the file if you're using a text editor, Excel, or the Jupyter Lab editor.\n", "* You can reopen the file to make sure your changes were saved.\n", "\n", "Now let's read our CSV file back and overwrite our original `stop_words` list variable. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading in a Stopwords CSV" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Open the CSV file and list the contents\n", "\n", "with open('data/stop_words.csv', 'r') as f:\n", " stop_words = f.read().strip().split(\",\")\n", "stop_words[-10:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Refining a stopwords list for your analysis can take time. It depends on:\n", "\n", "* What you are hoping to discover (for example, are function words important?)\n", "* The material you are analyzing (for example, journal articles may repeat words like \"abstract\")\n", "\n", "If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to refine a good stopword list.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": null }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }