{ "cells": [ { "cell_type": "markdown", "id": "50463b40", "metadata": {}, "source": [ "
\n", "\n", "Created by [Nathan Kelber](http://nkelber.com) for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n", "For questions/comments/improvements, email nathan.kelber@ithaka.org.
\n", "____" ] }, { "cell_type": "markdown", "id": "96f72056", "metadata": {}, "source": [ "\n", "# Tokenizers\n", "\n", "**Description:**\n", "This notebook focuses on the basic concepts surrounding tokenization. It includes material on the following concepts:\n", "\n", "* Word segmentation\n", "* n-grams\n", "* Stemming\n", "* Lemmatization\n", "* Tokenizers\n", "\n", "**Use Case:** For Learners (Detailed explanation, not ideal for researchers)\n", "\n", "**Difficulty:** Intermediate\n", "\n", "**Completion time:** 90 minutes\n", "\n", "**Knowledge Required:** \n", "* Python Basics ([Start Python Basics 1](../Python-basics/python-basics-1.ipynb))\n", "\n", "**Knowledge Recommended:** \n", "* [Python Intermediate 2](../Python-intermediate/python-intermediate-2.ipynb)\n", "\n", "**Data Format:** None\n", "\n", "**Libraries Used:**\n", "* urllib.request\n", "* NLTK\n", "* spaCy\n", "\n", "**Research Pipeline:**\n", "\n", "1. Scan documents\n", "2. OCR files\n", "3. Clean up texts\n", "4. **Tokenize text files** (this notebook)\n", "___" ] }, { "cell_type": "markdown", "id": "e7737942", "metadata": {}, "source": [ "## What is a word?\n", "\n", "The concept of a word makes intuitive sense in everyday language, but it starts to break down significantly when we begin trying to formalize it for analysis with computer programs. Linguists have spent decades creating formal rules for breaking down texts into smaller parts for analysis, dealing in great detail with the normally unspoken rules of grammar. In this lesson, we consider what a word is and consider how we could write a program for collecting the words within a text.\n", "\n", "Let's take a look at an example sentence:\n", "\n", "> Now that summer's here, we're going to visit the beach at Lake Michigan and eat ice cream.\n", "\n", "How many words are in this sentence? We could start by simply looking at words that are separated by spaces. \n", "\n", "> Now, that, summer's, here, we're, going, to, visit, the, beach, at, Lake, Michigan, and, eat, ice, cream.\n", "\n", "That would give us 17 words. But we could ask a few questions about this count. For example, is 'Lake Michigan' one word or two words? Certainly, lake and Michigan have their own individual meanings, but Lake Michigan certainly has a different meaning from either of those words individually. Similarly, what about 'ice cream'?\n", "\n", "What about contractions? Is 'we're' a single word or two words: 'we' and 'are'? If our goal is to count how many times a given word occurs in the sentence, does 'we' occur in the sentence? Does the word 'summer' occur in our sentence?\n", "\n", "Verb conjugations pose yet another problem. Should the word 'going' be counted separately from 'go'. What about 'went'? From a computational linguistics perspective, we could 'stem' words, simply lopping off the 'ing' from 'going' to get 'go'. But that would poses some serious programming challenges for words like 'running' where the base form is 'run' instead of 'runn'. And we might run into issues with words 'sing' or 'singing' that should not have 'ing' removed in the former case but once in the later case. How could we distinguish between words that are conjugated, like'sings', and words that are plural like 'wings'. Sometimes an -s ending is plural (fens) and other times it is not (lens).\n", "\n", "## Tokenization\n", "\n", "Tokenization, or segmenting a text into word chunks, is the first part of a Natural Language Processing pipeline. Tokens can be sentences, words, or sub-word chunks. The tokenization process involves many practical decisions, and this has led to many different methods that are reflected by a variety of available tokenizers. A tokenizer takes a text as input and generated tokens as output automatically.\n", "\n", "In the case of tokenizing words, this is traditionally done by splitting on whitespace and punctuation. (There are more advanced tokenization methods for language models such as BERT and GPT. These include Byte-Pair Encoding, WordPiece, and SentencePiece.) We will look at a few examples of traditional tokenizers with a goal of gathering tokens into one-, two-, and three-word constructions. The general name for these is n-grams.\n", "\n", "An n-gram is a sequence of n items from a given sample of text or speech. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:\n", "\n", "* stock (a 1-gram, or unigram)\n", "* vegetable stock (a 2-gram, or bigram)\n", "* homemade vegetable stock (a 3-gram, or trigram)\n", "\n", "A text analysis approach that looks only at unigrams would not be able to differentiate between the \"stock\" in \"stock market\" and \"vegetable stock.\" By including bigrams and trigrams in our analysis, we are able to look at concepts that extend across multiple words. One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams)." ] }, { "cell_type": "markdown", "id": "2515b657", "metadata": {}, "source": [ "## Constellate Datasets\n", "\n", "The Constellate [dataset builder](https://constellate.org/builder) has a historical term frequency viewer that is similar to the Google N-Gram Viewer. For example, we could create a dataset of medical journals and see how common particular terms are over time. \n", "\n", "![The Constellate Term Frequency Viewer showing diseases represented in medical journals in the 20th century](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/frequency-viewer.png)\n", "\n", "The Constellate term frequency viewer will graph frequencies for bigrams and trigrams as well.\n", "\n", "![The Constellate Term Frequency Viewer showing the frequency of different kinds of fevers whose names are bigrams](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/frequency-viewer-2.png)\n", "\n", "Building a dataset triggers a process that gathers up all the unigrams, bigrams, and trigrams for the documents you've selected. We are able to supply these n-gram lists with their accompanying metadata for any source, even if the materials are under copyright. This is the essence of a \"non-consumptive\" dataset. The researcher can access the n-grams but not the underlying full-text. In cases where there are no copyright restrictions, we also supply the full-text of the material.\n", "\n", "The materials are available for download and analysis in [several dataset types](https://constellate.org/docs/what-format-are-jstor-portico-datasets). The most complete type is a JSON-Lines file which contains all of the data we can legally provide. Many of the notebooks we offer rely on this [data format](https://constellate.org/docs/what-format-are-jstor-portico-datasets) and make it easy to accomplish common text analysis tasks such as counting word frequencies, creating word clouds, significant terms weighting, and topic modeling. \n", "\n", "We can create our own Constellate-compatible datasets from any texts by extracting the unigrams, bigrams, trigrams, and full text. We would then simply need to put them into the appropriate form matching the Constellate data schema. Then we could run the analyses mentioned above on our own texts. This notebook focuses on the tokenization processes to gather the unigrams, bigrams, and trigrams." ] }, { "cell_type": "markdown", "id": "7c0303d2", "metadata": {}, "source": [ "## Creating your own basic tokenizer\n", "\n", "It is possible to create your own basic tokenizer by using Python string methods. The following example uses the `.split()` method to gather unigrams." ] }, { "cell_type": "code", "execution_count": null, "id": "e3b92c78", "metadata": {}, "outputs": [], "source": [ "# Download Shakespeare's Othello from Project Gutenberg\n", "import urllib.request\n", "from pathlib import Path\n", "\n", "# Check if a data folder exists. If not, create it.\n", "data_folder = Path('./data/')\n", "data_folder.mkdir(exist_ok=True)\n", "\n", "text_address = 'https://www.gutenberg.org/cache/epub/1531/pg1531.txt'\n", "text_name = './data/' + text_address.rsplit('/', 1)[-1]\n", "urllib.request.urlretrieve(text_address, text_name)" ] }, { "cell_type": "code", "execution_count": null, "id": "ee17ae0a", "metadata": {}, "outputs": [], "source": [ "# Opening a file in read mode\n", "with open(text_name, 'r') as f:\n", " text = f.read()\n", " print(text)" ] }, { "cell_type": "code", "execution_count": null, "id": "9a00864c", "metadata": {}, "outputs": [], "source": [ "# See the raw string version of our text\n", "text" ] }, { "cell_type": "code", "execution_count": null, "id": "292a965f", "metadata": {}, "outputs": [], "source": [ "# Splitting the text string into a list of strings\n", "tokenized_list = text.split()\n", "list(tokenized_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "2a168363", "metadata": {}, "outputs": [], "source": [ "# Cleaning up the tokens\n", "unigrams = []\n", "\n", "for token in tokenized_list:\n", " token = token.lower() # lowercase tokens\n", " token = token.replace('.', '') # remove periods\n", " token = token.replace('!', '') # remove exclamation points\n", " token = token.replace('?', '') # remove question marks\n", " unigrams.append(token)" ] }, { "cell_type": "code", "execution_count": null, "id": "cb63d832", "metadata": {}, "outputs": [], "source": [ "# Preview the unigrams\n", "list(unigrams)" ] }, { "cell_type": "code", "execution_count": null, "id": "2b89861e", "metadata": {}, "outputs": [], "source": [ "# Count up the tokens using a Counter() object\n", "from collections import Counter\n", "word_counts = Counter(unigrams)\n", "print(word_counts)" ] }, { "cell_type": "markdown", "id": "c5998030", "metadata": {}, "source": [ "## NLTK\n", "\n", "While writing your own tokenizer may allow you to create highly customized results, it is easier and more often more effective to use existing tokenizers offered in packages such as the Natural Language Toolkit (NLTK) and spaCy. Ultimately, whatever tokenizer you use, it is helpful to understand Python string manipulations and regular expressions in case you wish to adapt a particular tokenizer to your texts. \n", "\n", "\n", "The NLTK library has multiple tokenizers available.\n", "\n", "### [Word Punctuation](https://www.nltk.org/_modules/nltk/tokenize/punkt.html)\n", "The word punctuation tokenizer splits on white spaces and splits out punctuation into separate tokens.\n", "\n", "### [Penn Treebank](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)\n", "The Tree Bank tokenizer is the default tokenizer for NLTK. It features a variety of regular expressions for addressing punctuation such as contractions, quotes, parentheses, brackets, and dashes.\n", "\n", "### [Tweet](https://www.nltk.org/_modules/nltk/tokenize/casual.html#TweetTokenizer)\n", "The Twitter tokenizer is designed to work with Twitter and social media text. It uses regular expressions for addressing emoticons, phone numbers, URLs, Twitter usernames, and email addresses.\n", "\n", "### [Multi-Word Expression](https://www.nltk.org/_modules/nltk/tokenize/mwe.html)\n", "The MWETokenizer takes a \"string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs.\" The lexicon of Multi-Word Entities is constructed by the user. It can be constructed ad-hoc depended on the user's research interest or discovered through the use of techniques like part of speech tagging, collocation, and named entity recognition." ] }, { "cell_type": "code", "execution_count": null, "id": "e7bd7206", "metadata": {}, "outputs": [], "source": [ "# Import a variety of tokenizers\n", "import nltk\n", "nltk.download('punkt', download_dir='./data/nltk_data')\n", "nltk.download('averaged_perceptron_tagger', download_dir='./data/nltk_data')\n", "from nltk.tokenize import (TreebankWordTokenizer,\n", " word_tokenize,\n", " wordpunct_tokenize,\n", " TweetTokenizer,\n", " MWETokenizer)" ] }, { "cell_type": "code", "execution_count": null, "id": "14904c11", "metadata": {}, "outputs": [], "source": [ "string = \"Nathan Kelber is helping us tokenize with the Constellate platform. http://constellate.org #NLP\"" ] }, { "cell_type": "code", "execution_count": null, "id": "c2ccb5bc", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Python .split() tokenization\n", "split_tokens = string.split()\n", "print('Python .split()')\n", "print(split_tokens, '\\n')\n", "\n", "# Punctuation-based tokenization\n", "punct_tokens = wordpunct_tokenize(string)\n", "print('Wordpunct tokenizer')\n", "print(punct_tokens, '\\n')\n", "\n", "# Treebank Tokenizer\n", "treebank_tokens = TreebankWordTokenizer().tokenize(string)\n", "print('Treebank Tokenizer')\n", "print(treebank_tokens, '\\n')\n", "\n", "# TweetTokenizer\n", "tweet_tokens = TweetTokenizer().tokenize(string)\n", "print('Tweet Tokenizer')\n", "print(tweet_tokens, '\\n')\n", "\n", "# Multi-Word Expression Tokenizer\n", "tokenizer = MWETokenizer([('Nathan', 'Kelber')])\n", "MWE_tokens = tokenizer.tokenize(word_tokenize(string))\n", "print('MWE Tokenizer')\n", "print(MWE_tokens)" ] }, { "cell_type": "markdown", "id": "0d1e81ad", "metadata": {}, "source": [ "The tokenizer will generate a list of unigrams, but we still need to generate our bigrams and trigrams. We can simply pass the tokens into NLTK's bigrams and trigrams methods then store the results in a list." ] }, { "cell_type": "code", "execution_count": null, "id": "fd81f814", "metadata": {}, "outputs": [], "source": [ "# Creating our bigrams and trigrams\n", "bigrams = list(nltk.bigrams(treebank_tokens))\n", "trigrams = list(nltk.trigrams(treebank_tokens))\n", "\n", "print('Bigrams: \\n ', bigrams, '\\n')\n", " \n", "print('Trigrams: \\n,', trigrams)\n" ] }, { "cell_type": "markdown", "id": "64d9d630", "metadata": {}, "source": [ "The NLTK bigrams and trigrams method creates a list of bigrams that are tuples. If we want them to be strings, then we would need to access each index of the tuple and create a string out of it." ] }, { "cell_type": "code", "execution_count": null, "id": "4e63a282", "metadata": {}, "outputs": [], "source": [ "# Function definitions for Converting NLTK tuples into strings\n", "\n", "from collections import Counter\n", "\n", "def convert_tuple_bigrams(tuples_to_convert):\n", " \"\"\"Converts NLTK tuples into bigram strings\"\"\"\n", " string_grams = []\n", " for tuple_grams in tuples_to_convert:\n", " first_word = tuple_grams[0]\n", " second_word = tuple_grams[1]\n", " gram_string = f'{first_word} {second_word}'\n", " string_grams.append(gram_string)\n", " return string_grams\n", "\n", "def convert_tuple_trigrams(tuples_to_convert):\n", " \"\"\"Converts NLTK tuples into trigram strings\"\"\"\n", " string_grams = []\n", " for tuple_grams in tuples_to_convert:\n", " first_word = tuple_grams[0]\n", " second_word = tuple_grams[1]\n", " third_word = tuple_grams[2]\n", " gram_string = f'{first_word} {second_word} {third_word}'\n", " string_grams.append(gram_string)\n", " return string_grams\n", "\n", "def convert_strings_to_counts(string_grams):\n", " \"\"\"Converts a Counter of n-grams into a dictionary\"\"\"\n", " counter_of_grams = Counter(string_grams)\n", " dict_of_grams = dict(counter_of_grams)\n", " return dict_of_grams" ] }, { "cell_type": "code", "execution_count": null, "id": "0a055f9d", "metadata": {}, "outputs": [], "source": [ "# Converting the tuples\n", "string_bigrams = convert_tuple_bigrams(bigrams)\n", "bigramCount = convert_strings_to_counts(string_bigrams)\n", "\n", "print('Bigrams as a dictionary of counts')\n", "print(bigramCount, '\\n')\n", "\n", "string_trigrams = convert_tuple_trigrams(trigrams)\n", "trigramCount = convert_strings_to_counts(string_trigrams)\n", "\n", "print('Trigrams as a dictionary of counts')\n", "print(trigramCount)" ] }, { "cell_type": "markdown", "id": "5ca9dfd6", "metadata": {}, "source": [ "Depending on the analysis we are doing, we may want to group similar words together. For example, we may want to group plural words together and verb tenses.\n", "\n", "* ducks -> duck\n", "* flown -> fly\n", "\n", "To accomplish this, we could use a stemmer, such as the Snowball stemmer. A stemmer removes the last part of particular words to get a base form. It is a quick method which is useful for very large datasets and/or working with limited computing power.\n", "\n", "In an ideal world, a lemmatizer will do a better job. It does not simply strip off letters but looks up verb tenses and takes into account the part of speech of each word." ] }, { "cell_type": "code", "execution_count": null, "id": "783f4743", "metadata": {}, "outputs": [], "source": [ "# Snowball stemmer\n", "from nltk.stem.snowball import SnowballStemmer\n", "stemmer = SnowballStemmer('english')\n", "unstemmed_token = 'running'\n", "#unstemmed_token = 'flown'\n", "\n", "stemmed_token = stemmer.stem(unstemmed_token)\n", "\n", "print(stemmed_token)" ] }, { "cell_type": "markdown", "id": "555b628a", "metadata": {}, "source": [ "Part of Speech tagging allows us to see the parts of speech of various tokens." ] }, { "cell_type": "code", "execution_count": null, "id": "5879cc2d", "metadata": {}, "outputs": [], "source": [ "# Part of Speech Tagging\n", "pos_list = nltk.pos_tag(nltk.word_tokenize(string))\n", "print(pos_list)" ] }, { "cell_type": "markdown", "id": "338f251e", "metadata": {}, "source": [ "## spaCy\n", "\n", "spaCy takes a different approach from NLTK, creating a document model of a text. It is more sophisticated, but uses a different syntax for NLP tasks.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d78da23a", "metadata": {}, "outputs": [], "source": [ "# Install the spaCy Program\n", "!pip install spacy\n", "!pip install -U pip setuptools wheel\n", "!pin install -U spacy\n", "!python -m spacy download en_core_web_sm" ] }, { "cell_type": "code", "execution_count": null, "id": "ef85c18a", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from spacy.lang.en import English\n", "\n", "nlp = English()\n", "\n", "string = \"Nathan Kelber is helping us tokenize with the Constellate platform. http://constellate.org #NLP\"\n", "\n", "my_doc = nlp(string)\n", "\n", "tokens = []\n", "for token in my_doc:\n", " tokens.append(token.text)\n", "\n", "print(tokens)" ] }, { "cell_type": "markdown", "id": "ed68a265", "metadata": {}, "source": [ "In order to change tokenization with spaCy, you can [add rules](https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/). spaCy also supports Parts of Speech tagging and lemmatization." ] }, { "cell_type": "code", "execution_count": null, "id": "c9d96ff3", "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en_core_web_sm')\n", "my_doc = nlp(string)\n", "\n", "print('Parts of Speech')\n", "for token in my_doc:\n", " print(token, token.pos_,)\n", "\n", "print('\\nLemmatizations')\n", "for token in my_doc:\n", " print(token, token.lemma_)" ] }, { "cell_type": "markdown", "id": "376d2ac9", "metadata": {}, "source": [ "We can gather our n-grams by defining a function that accepts our tokens and an argument `n` for the \"n\" in \"n-gram.\" So, a bigram would be n = 2." ] }, { "cell_type": "code", "execution_count": null, "id": "d4374a5e", "metadata": {}, "outputs": [], "source": [ "# A function for gathering n-grams with spaCy\n", "def n_grams(tokens, n):\n", " n_grams = []\n", " for i in range(len(tokens)-n+1):\n", " n_grams.append(tokens[i:i+n])\n", " return(n_grams)\n", " # return[tokens[i:i+n] for i in range(len(tokens)-n+1)] # Written as a list comprehension" ] }, { "cell_type": "code", "execution_count": null, "id": "e492a401", "metadata": {}, "outputs": [], "source": [ "bigrams = n_grams(tokens, 2)\n", "trigrams = n_grams(tokens, 3)\n", "print(bigrams)\n", "print(trigrams)" ] }, { "cell_type": "markdown", "id": "78f5265f", "metadata": {}, "source": [ "While NLTK and spaCy tokenizers are the most prominent, there are also tokenizers available for packages such as:\n", "\n", "* [Gensim](https://radimrehurek.com/gensim/)\n", "* [Keras](https://keras.io/)\n", "* [Stanford NLP](https://nlp.stanford.edu/software/tokenizer.shtml)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }