{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural language processing concepts with spaCy\n", "\n", "By [Allison Parrish](http://www.decontextualize.com/)\n", "\n", "“Natural Language Processing” is a field at the intersection of computer science, linguistics and artificial intelligence which aims to make the underlying structure of language available to computer programs for analysis and manipulation. It’s a vast and vibrant field with a long history! New research and techniques are being developed constantly.\n", "\n", "The aim of this notebook is to introduce a few simple concepts and techniques from NLP—just the stuff that’ll help you do creative things quickly, and maybe open the door for you to understand more sophisticated NLP concepts that you might encounter elsewhere. We'll start with simple extraction tasks: isolating words, sentences, and parts of speech. By the end, we'll have a few working systems for creating sophisticated text generators that function by remixing texts based on their constituent linguistic units. This tutorial is written for Python 3.6+.\n", "\n", "There are a number of libraries for performing natural language processing tasks in Python, including:\n", "\n", "* [NLTK](http://www.nltk.org/): a workhorse of a library, still widely used, but somewhat dated in terms of capabilities\n", "* [Stanza](https://stanfordnlp.github.io/stanza/index.html) is a newish natural language processing library for Python, developed by the storied [Stanford NLP Group](https://nlp.stanford.edu/). It includes a client library for Stanford NLP's [CoreNLP library](https://stanfordnlp.github.io/CoreNLP/) for functionality not yet natively available in Python.\n", "* [AllenNLP](https://allennlp.org/) is a library for developing and deploying natural language processing machine learning models, including models for [sentiment analysis](https://demo.allennlp.org/sentiment-analysis), [constituency parsing](https://demo.allennlp.org/constituency-parsing) and [co-reference resolution](https://demo.allennlp.org/coreference-resolution).\n", "\n", "But we'll be using a library called [spaCy](https://spacy.io/), which very powerful and easy for newcomers to understand. It's been among the most important tools in my text processing toolbox for many years!\n", "\n", "\n", "## Natural language\n", "\n", "“Natural language” is a loaded phrase: what makes one stretch of language “natural” while another stretch is not? NLP techniques are opinionated about what language is and how it works; as a consequence, you’ll sometimes find yourself having to conceptualize your text with uncomfortable abstractions in order to make it work with NLP. (This is especially true of poetry, which almost by definition breaks most “conventional” definitions of how language behaves and how it’s structured.)\n", "\n", "Of course, a computer can never really fully “understand” human language. Even when the text you’re using fits the abstractions of NLP perfectly, the results of NLP analysis are always going to be at least a little bit inaccurate. But often even inaccurate results can be “good enough”—and in any case, inaccurate output from NLP procedures can be an excellent source of the sublime and absurd juxtapositions that we (as poets) are constantly in search of.\n", "\n", "## Language support\n", "\n", "Historically, most NLP researchers have focused their efforts on English specifically. But many natural language processing libraries now support a wide range of languages. You can find the full [list of supported languages](https://spacy.io/usage/models#languages) on their website, though the robustness of these models varies from one language to the next, as does the specifics of how the model works. (For example, different languages have different ideas about what a \"part of speech\" is.) The examples in this notebook are primarily in English. If you're having trouble applying these techniques to other languages, send me an e-mail—I'd be happy to help you figure out how to get things working for languages other than English!\n", "\n", "## English grammar: a crash course\n", "\n", "The only thing I believe about English grammar is [this](http://www.writing.upenn.edu/~afilreis/88v/creeley-on-sentence.html):\n", "\n", "> \"Oh yes, the sentence,\" Creeley once told the critic Burton Hatlen, \"that's\n", "> what we call it when we put someone in jail.\"\n", "\n", "There is no such thing as a sentence, or a phrase, or a part of speech, or even\n", "a \"word\"---these are all pareidolic fantasies occasioned by glints of sunlight\n", "we see on reflected on the surface of the ocean of language; fantasies that we\n", "comfort ourselves with when faced with language's infinite and unknowable\n", "variability.\n", "\n", "Regardless, we may find it occasionally helpful to think about language using\n", "these abstractions. The following is a gross oversimplification of both how\n", "English grammar works, and how theories of English grammar work in the context\n", "of NLP. But it should be enough to get us going!\n", "\n", "### Sentences and parts of speech\n", "\n", "English texts can roughly be divided into \"sentences.\" Sentences are themselves\n", "composed of individual words, each of which has a function in expressing the\n", "meaning of the sentence. The function of a word in a sentence is called its\n", "\"part of speech\"—i.e., a word functions as a noun, a verb, an adjective, etc.\n", "Here's a sentence, with words marked for their part of speech:\n", "\n", " I really love entrees from the new cafeteria.\n", " pronoun adverb verb noun (plural) preposition determiner adjective noun\n", "\n", "Of course, the \"part of speech\" of a word isn't a property of the word itself.\n", "We know this because a single \"word\" can function as two different parts of speech:\n", "\n", "> I love cheese.\n", "\n", "The word \"love\" here is a verb. But here:\n", "\n", "> Love is a battlefield.\n", "\n", "... it's a noun. For this reason (and others), it's difficult for computers to\n", "accurately determine the part of speech for a word in a sentence. (It's\n", "difficult sometimes even for humans to do this.) But NLP procedures do their\n", "best!\n", "\n", "### Phrases and larger syntactic structures\n", "\n", "There are several different ways for talking about larger syntactic structures in sentences. The scheme used by spaCy is called a \"dependency grammar.\" We'll talk about the details of this below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing spaCy\n", "\n", "There are [instructions for installing spaCy](https://spacy.io/usage) on the spaCy web page. You can also install it by running the following cell in this notebook:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting package metadata (current_repodata.json): done\n", "Solving environment: done\n", "\n", "\n", "==> WARNING: A newer version of conda exists. <==\n", " current version: 4.10.1\n", " latest version: 4.10.3\n", "\n", "Please update conda by running\n", "\n", " $ conda update -n base -c defaults conda\n", "\n", "\n", "\n", "# All requested packages already installed.\n", "\n" ] } ], "source": [ "import sys\n", "!conda install -c conda-forge -y --prefix {sys.prefix} spacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll also need to download a language model. You can download the default language model for English by running the cell below:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting en_core_web_md==2.3.1\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.3.1/en_core_web_md-2.3.1.tar.gz (50.8 MB)\n", "\u001b[K |████████████████████████████████| 50.8 MB 10.4 MB/s eta 0:00:01 |█ | 1.6 MB 5.1 MB/s eta 0:00:10 |███████████████████████ | 36.4 MB 12.6 MB/s eta 0:00:02 |███████████████████████▎ | 36.9 MB 12.6 MB/s eta 0:00:02\n", "\u001b[?25hRequirement already satisfied: spacy<2.4.0,>=2.3.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from en_core_web_md==2.3.1) (2.3.2)\n", "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (1.0.2)\n", "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (0.9.6)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (2.25.1)\n", "Requirement already satisfied: thinc==7.4.1 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (7.4.1)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (4.61.1)\n", "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (0.4.1)\n", "Requirement already satisfied: numpy>=1.15.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (1.20.2)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (0.8.2)\n", "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (1.0.0)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (3.0.2)\n", "Requirement already satisfied: setuptools in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (52.0.0.post20210125)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (2.0.3)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (1.0.0)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (1.26.6)\n", "Requirement already satisfied: chardet<5,>=3.0.2 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (4.0.0)\n", "Requirement already satisfied: idna<3,>=2.5 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (2.10)\n", "Requirement already satisfied: certifi>=2017.4.17 in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_md==2.3.1) (2021.5.30)\n", "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", "You can now load the model via spacy.load('en_core_web_md')\n" ] } ], "source": [ "import sys\n", "!{sys.executable} -m spacy download en_core_web_md" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace `en_core_web_md` with the name of the model you want to install. [The spaCy documentation explains the difference between the various models](https://spacy.io/models).\n", "\n", "The language model contains machine learning models for splitting texts into sentences and words, tagging words with their parts of speech, identifying entities, and discovering the syntactic structure of sentences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic usage\n", "\n", "Import `spacy` like any other Python module:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import spacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a new spaCy object using `spacy.load(...)`. The name in the parentheses is the same as the name of the model you downloaded above. If you downloaded a different model, you can put its name here instead." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load('en_core_web_md')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's more fun doing natural language processing on text that you're interested in. I recommend grabbing a something from [Project Gutenberg](http://www.gutenberg.org/). Download a plain text file and put it in the same directory as this notebook, taking care to replace the filename in the cell below with the name of the file you downloaded." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# replace \"84-0.txt\" with the name of your own text file\n", "text = open(\"84-0.txt\").read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, use spaCy to parse it. (This might take a while, depending on the size of your text.)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "doc = nlp(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right off the bat, the spaCy library gives us access to a number of interesting units of text:\n", "\n", "* All of the sentences (`doc.sents`)\n", "* All of the words (`doc`)\n", "* All of the \"named entities,\" like names of places, people, #brands, etc. (`doc.ents`)\n", "* All of the \"noun chunks,\" i.e., nouns in the text plus surrounding matter like adjectives and articles\n", "\n", "The cell below, we extract these into variables so we can play around with them a little bit." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "sentences = list(doc.sents)\n", "words = [w for w in list(doc) if w.is_alpha]\n", "noun_chunks = list(doc.noun_chunks)\n", "entities = list(doc.ents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counting and sampling\n", "\n", "With this information in hand, we can answer interesting questions like: how many sentences are in the text?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3873" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `random.sample()`, we can get a small, randomly-selected sample from these lists. Here are five random sentences:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "But even human sympathies were not sufficient to satisfy his eager mind.\n", "\n", "The magistrate listened to me with attention and kindness.\n", "\n", "It was, in fact, a sledge, like that we had seen before, which had drifted towards us in the night on a large fragment of ice.\n", "\n", "The blue Mediterranean appeared, and by a strange chance, I saw the fiend enter by night and hide himself in a vessel bound for the Black Sea.\n", "\n", "I myself was about to sink under the accumulation of distress when I saw your vessel riding at anchor and holding forth to me hopes of succour and life.\n", "\n" ] } ], "source": [ "import random\n", "for item in random.sample(sentences, 5):\n", " print(item.text.strip().replace(\"\\n\", \" \"))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ten random words:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "said\n", "it\n", "after\n", "day\n", "not\n", "for\n", "the\n", "did\n", "hastened\n", "eradicated\n" ] } ], "source": [ "for item in random.sample(words, 10):\n", " print(item.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ten random noun chunks:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "men\n", "the boat\n", "the spot\n", "intervals\n", "It\n", "that passion\n", "he\n", "the best houses\n", "my tranquillity\n", "their path\n" ] } ], "source": [ "for item in random.sample(noun_chunks, 10):\n", " print(item.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ten random entities:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "United States\n", "next day\n", "M. Clerval\n", "Elizabeth\n", "England\n", "Justine\n", "M. Duvillard\n", "Shelley\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "first\n", "the next morning\n" ] } ], "source": [ "for item in random.sample(entities, 10):\n", " print(item.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### spaCy data types\n", "\n", "Note that the values that spaCy returns belong to specific spaCy data types. You can read more about [these data types](https://spacy.io/api) in the spaCy documentation, in particular [spans](https://spacy.io/api/span/) and [tokens](https://spacy.io/api/token). (Spans represent sequences of tokens; a sentence in spaCy is a span, and a word is a token.) If you want a list of strings instead of a list of spaCy objects, use the `.text` attribute, which works for spans and tokens alike. For example:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "sentence_strs = [item.text for item in doc.sents]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['“I continued to wind among the paths of the wood, until I came to its\\nboundary, which was skirted by a deep and rapid river, into which many\\nof the trees bent their branches, now budding with the fresh spring.\\n',\n", " 'Oh, that some encouraging\\nvoice would answer in the affirmative!',\n", " 'Not the\\nten-thousandth portion of the anguish that was mine during the\\nlingering detail of its execution. ',\n", " 'You may remember that a\\nhistory of all the voyages made for purposes of discovery composed the\\nwhole of our good Uncle Thomas’ library. ',\n", " 'Adieu, my dear Margaret. ',\n", " 'In some degree, also, they\\ndiverted my mind from the thoughts over which it had brooded for the\\nlast month. ',\n", " 'He was for ever busy, and the only check to his\\nenjoyments was my sorrowful and dejected mind. ',\n", " 'Ah! ',\n", " 'I wait',\n", " 'I was\\nnourished with high thoughts of honour and devotion.']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.sample(sentence_strs, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parts of speech\n", "\n", "The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—`nouns`, `verbs`, `adjs` and `advs`—that contain only words of the specified parts of speech. ([There's a full list of part of speech tags here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english))." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "nouns = [w for w in words if w.pos_ == \"NOUN\"]\n", "verbs = [w for w in words if w.pos_ == \"VERB\"]\n", "adjs = [w for w in words if w.pos_ == \"ADJ\"]\n", "advs = [w for w in words if w.pos_ == \"ADV\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can print out a random sample of any of these:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "books\n", "child\n", "difficulty\n", "deserts\n", "functions\n", "purpose\n", "desponding\n", "Foundation\n", "uncle\n", "father\n", "father\n", "enchantment\n", "cause\n", "conception\n", "beings\n", "errors\n", "town\n", "state\n", "partiality\n", "child\n" ] } ], "source": [ "for item in random.sample(nouns, 20): # change \"nouns\" to \"verbs\" or \"adjs\" or \"advs\" to sample from those lists!\n", " print(item.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Entity types\n", "\n", "The parser in spaCy not only identifies \"entities\" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "people = [e for e in entities if e.label_ == \"PERSON\"]\n", "locations = [e for e in entities if e.label_ == \"LOC\"]\n", "times = [e for e in entities if e.label_ == \"TIME\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then you can print out a random sample:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nearly two hours\n", "the night\n", "the morning\n", "the sixth hour\n", "the morning\n", "a few moments\n", "the night\n", "about eight\n", "o’clock\n", "eight o’clock\n", "midnight\n", "this hour\n", "a few sad hours\n", "that hour\n", "a few hours\n", "night\n", "an hour\n", "a few minutes\n", "this night\n", "before morning\n", "night\n" ] } ], "source": [ "for item in random.sample(times, 20): # change \"times\" to \"people\" or \"locations\" to sample those lists\n", " print(item.text.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding the most common\n", "\n", "After we've parsed the text out into meaningful units, it might be interesting to see which examples of those units are the most common in a text.\n", "\n", "One of the most common tasks in text analysis is counting how many times things occur in a text. The easiest way to do this in Python is with the [`Counter` object, contained in the `collections` module](https://docs.python.org/3/library/collections.html#collections.Counter). Run the following cell to create a `Counter` object to count your words." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "word_count = Counter([w.text for w in words])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you've created the counter, you can check to see how many times any word occurs like so:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_count['heaven']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Counter` object's `.most_common()` method gives you access to a list of tuples with words and their counts, sorted in reverse order by count:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 4070),\n", " ('and', 3006),\n", " ('I', 2847),\n", " ('of', 2746),\n", " ('to', 2155),\n", " ('my', 1635),\n", " ('a', 1402),\n", " ('in', 1135),\n", " ('was', 1019),\n", " ('that', 1018)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word_count.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code in the following cell prints this out nicely:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the 4070\n", "and 3006\n", "I 2847\n", "of 2746\n", "to 2155\n", "my 1635\n", "a 1402\n", "in 1135\n", "was 1019\n", "that 1018\n", "me 867\n", "with 705\n", "had 684\n", "not 576\n", "which 565\n", "but 552\n", "you 550\n", "his 502\n", "for 494\n", "as 492\n" ] } ], "source": [ "for word, count in word_count.most_common(20):\n", " print(word, count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll note that the list of most frequent words here likely reflects the overall frequency of words in English. Consult my [Quick and dirty keywords](quick-and-dirty-keywords.ipynb) tutorial for some simple strategies for extracting words that are most unique to a text (rather than simply the most frequent words). You may also consider [removing stop words](https://stackabuse.com/removing-stop-words-from-strings-in-python/) from the list." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing to a file\n", "\n", "You might want to export lists of words or other things that you make with spaCy to a file, so that you can bring them into other Python programs (or just other programs that form a part of your workflow). One way to do this is to write each item to a single line in a text file. The code in the following cell does exactly this for the word list that we just created:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "with open(\"words.txt\", \"w\") as fh:\n", " fh.write(\"\\n\".join([w.text for w in words]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell defines a function that performs this for any list of spaCy values you pass to it:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def save_spacy_list(filename, t):\n", " with open(filename, \"w\") as fh:\n", " fh.write(\"\\n\".join([item.text for item in t]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how to use it:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "save_spacy_list(\"words.txt\", words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we're working with Counter objects a bunch in this notebook, it makes sense to find a way to save these as files too. The following cell defines a function for writing data from a `Counter` object to a file. The file is in \"tab-separated values\" format, which you can open using most spreadsheet programs. Execute it before you continue:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def save_counter_tsv(filename, counter, limit=1000):\n", " with open(filename, \"w\") as outfile:\n", " outfile.write(\"key\\tvalue\\n\")\n", " for item, count in counter.most_common():\n", " outfile.write(item.strip() + \"\\t\" + str(count) + \"\\n\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "save_counter_tsv(\"100_common_words.tsv\", word_count, 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try opening this file in Excel or Google Docs or Numbers!\n", "\n", "If you want to write the data from another `Counter` object to a file:\n", "\n", "* Change the filename to whatever you want (though you should probably keep the `.tsv` extension)\n", "* Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count`\n", "* Change the number to the number of rows you want to include in your spreadsheet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When do things happen in this text?\n", "\n", "Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular \"times\" (durations, times of day, etc.) are mentioned in the text." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "time_counter = Counter([e.text.lower().strip() for e in times])\n", "save_counter_tsv(\"time_count.tsv\", time_counter, 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do the same thing, but with people:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "people_counter = Counter([e.text.lower() for e in people])\n", "save_counter_tsv(\"people_count.tsv\", people_counter, 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More about words\n", "\n", "The list of words that we made above is actually a list of spaCy [Token](https://spacy.io/docs/api/token) objects, which have several interesting attributes. The `.text` attribute gives the text of the word (as a Python string), and the `.lemma_` attribute gives the word's \"lemma\" (explained below):" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "projects → project\n", "own → own\n", "personal → personal\n", "in → in\n", "attentions → attention\n", "remotest → remote\n", "is → be\n", "his → -PRON-\n", "lay → lie\n", "than → than\n", "full → full\n", "a → a\n" ] } ], "source": [ "for word in random.sample(words, 12):\n", " print(word.text, \"→\", word.lemma_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A word's \"lemma\" is its most \"basic\" form, the form without any morphology\n", "applied to it. \"Sing,\" \"sang,\" \"singing,\" are all different \"forms\" of the\n", "lemma *sing*. Likewise, \"octopi\" is the plural of \"octopus\"; the \"lemma\" of\n", "\"octopi\" is *octopus*.\n", "\n", "\"Lemmatizing\" a text is the process of going through the text and replacing\n", "each word with its lemma. This is often done in an attempt to reduce a text\n", "to its most \"essential\" meaning, by eliminating pesky things like verb tense\n", "and noun number.\n", "\n", "Individual sentences can also be iterated over to get a list of words in that sentence:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I\n", "\n", "\n", "hastened\n", "to\n", "return\n", "home\n", ",\n", "and\n", "Elizabeth\n", "eagerly\n", "demanded\n", "the\n", "result\n", ".\n", "\n", "\n", "\n" ] } ], "source": [ "sentence = random.choice(sentences)\n", "for word in sentence:\n", " print(word.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parts of speech\n", "\n", "Token objects are tagged with their part of speech. For whatever reason, spaCy gives you this part of speech information in two different formats. The `pos_` attribute gives the part of speech using the [universal POS tag](https://universaldependencies.org/u/pos/) system, while the `tag_` attribute gives a more specific designation, using the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) system. (Models for different languages will use different schemes; consult the [documentation for your model](https://spacy.io/models) for more information). We used this attribute earlier in the notebook to extract lists of words that had particular parts of speech, but you can access the attribute in other contexts as well:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "and / CCONJ / CC\n", "the / DET / DT\n", "the / DET / DT\n", "work / NOUN / NN\n", "madman / NOUN / NN\n", "one / NUM / CD\n", "tears / NOUN / NNS\n", "they / PRON / PRP\n", "hearing / VERB / VBG\n", "of / ADP / IN\n", "with / ADP / IN\n", "nature / NOUN / NN\n", "that / DET / WDT\n", "the / DET / DT\n", "we / PRON / PRP\n", "which / DET / WDT\n", "me / PRON / PRP\n", "the / DET / DT\n", "creators / NOUN / NNS\n", "near / SCONJ / IN\n", "inanimate / ADJ / JJ\n", "my / DET / PRP$\n", "soon / ADV / RB\n", "by / ADP / IN\n" ] } ], "source": [ "for item in random.sample(words, 24):\n", " print(item.text, \"/\", item.pos_, \"/\", item.tag_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `spacy.explain()` function also gives information about what part of speech tags mean:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'verb, non-3rd person singular present'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.explain('VBP')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specific forms with `.tag_`\n", "\n", "The `.pos_` attribute only gives us general information about the part of speech. The `.tag_` attribute allows us to be more specific about the kinds of verbs we want. For example, this code gives us only the verbs in past participle form:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "only_past = [item.text for item in doc if item.tag_ == 'VBN']" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['attached',\n", " 'spent',\n", " 'manacled',\n", " 'facilitated',\n", " 'debilitated',\n", " 'occupied',\n", " 'confessed',\n", " 'been',\n", " 'expected',\n", " 'seized',\n", " 'endowed',\n", " 'blasted']" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.sample(only_past, 12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or only plural nouns:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "only_plural = [item.text for item in doc if item.tag_ == 'NNS']" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['plants',\n", " 'impressions',\n", " 'scents',\n", " 'muscles',\n", " 'eyes',\n", " 'purposes',\n", " 'looks',\n", " 'towns',\n", " 'causes',\n", " 'waters',\n", " 'labours',\n", " 'papers']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.sample(only_plural, 12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Larger syntactic units\n", "\n", "Okay, so we can get individual words and small phrases, like named entities and noun chunks. Great! But what if we want larger chunks, based on their syntactic role in the sentence? For this, we'll need to learn about how spaCy parses sentences into its syntactic components." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding dependency grammars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The spaCy library parses the underlying sentences using a [dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar). Dependency grammars look different from the kinds of sentence diagramming you may have done in high school, and even from tree-based [phrase structure grammars](https://en.wikipedia.org/wiki/Phrase_structure_grammar) commonly used in descriptive linguistics. The idea of a dependency grammar is that every word in a sentence is a \"dependent\" of some other word, which is that word's \"head.\" Those \"head\" words are in turn dependents of other words. The finite verb in the sentence is the ultimate \"head\" of the sentence, and is not itself dependent on any other word. The dependents of a particular head are sometimes called its \"children.\"\n", "\n", "The question of how to know what constitutes a \"head\" and a \"dependent\" is complicated. For more details, consult [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf). But here are some simple guidelines:\n", "\n", "* A head determines the syntactic category of a phrase as a whole, and can often stand in for the phrase.\n", "* The meaning of a phrase comes primarily from its head. The dependents further specify the meaning, but don't determine it.\n", "* Heads are obligatory, while dependents are often optional.\n", "\n", "For example, in the sentence \"Large contented bears hibernate peacefully,\" *bears* is the head (a noun in this case) and *large* and *contented* are dependents (adjectives). The head of the phrase *large contented bears* is a noun, so the entire phrase is a noun. You could also rewrite the sentence to omit the dependents altogether, and it would still make sense: \"Bears hibernate peacefully.\" Likewise, the adverb *peacefully* is a dependent of the head *hibernate*; the sentence could be rewritten as simply \"Bears hibernate.\" \n", "\n", "Dependents are related to their heads by a *syntactic relation*. The name of the syntactic relation describes the relationship between the head and the dependent. Every token object in a spaCy document or sentence has attributes that tell you what the word's head is, what the dependency relationship is between that word and its head, and a list of that word's children (dependents).\n", "\n", "The developers of spaCy included a little tool for visualizing the dependency relations of a particular sentence. Let's look at the sentence \"I have eaten the plums that were in the icebox\" as an example:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " I\n", " PRON\n", "\n", "\n", "\n", " have\n", " AUX\n", "\n", "\n", "\n", " eaten\n", " VERB\n", "\n", "\n", "\n", " the\n", " DET\n", "\n", "\n", "\n", " plums\n", " NOUN\n", "\n", "\n", "\n", " that\n", " DET\n", "\n", "\n", "\n", " were\n", " AUX\n", "\n", "\n", "\n", " in\n", " ADP\n", "\n", "\n", "\n", " the\n", " DET\n", "\n", "\n", "\n", " icebox.\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " aux\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " relcl\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "spacy.displacy.render(nlp(\"I have eaten the plums that were in the icebox.\"), style='dep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The arcs you see originate at a head and terminate at dependencies (or children). If you follow all of the arcs back from dependent to head, you'll eventually get back to *eaten*, which is the *root* of the sentence. Each arc is labelled with the dependency *relation*, which tells us what role the dependent fills in the syntax and meaning of the parent word. For example, *I* is related to *eaten* by the `nsubj` relation, which means that *I* is the \"nominal subject\" of the verb. The word *icebox* is related to the head *in* via the `pobj` relation, meaning that *icebox* is the object of the preposition *in* An exhaustive list of the meanings of these relations can be found in the [Stanford Dependencies Manual](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf).\n", "\n", "The following code prints out each word in the sentence, the tag, the word's head, the word's dependency relation with its head, and the word's children (i.e., dependent words). (This code isn't especially useful on its own, it's just here to help show you how this functionality works.)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original sentence: He is now much recovered from his illness and is continually on the deck, apparently watching for the sledge that preceded his own.\n", "\n", "Word: He\n", "Tag: PRP\n", "Head: recovered\n", "Dependency relation: nsubjpass\n", "Children: []\n", "\n", "Word: is\n", "Tag: VBZ\n", "Head: recovered\n", "Dependency relation: auxpass\n", "Children: []\n", "\n", "Word: now\n", "Tag: RB\n", "Head: recovered\n", "Dependency relation: advmod\n", "Children: []\n", "\n", "Word: much\n", "Tag: RB\n", "Head: recovered\n", "Dependency relation: advmod\n", "Children: []\n", "\n", "Word: recovered\n", "Tag: VBN\n", "Head: recovered\n", "Dependency relation: ROOT\n", "Children: [He, is, now, much, from, and, is, .]\n", "\n", "Word: from\n", "Tag: IN\n", "Head: recovered\n", "Dependency relation: prep\n", "Children: [illness]\n", "\n", "Word: his\n", "Tag: PRP$\n", "Head: illness\n", "Dependency relation: poss\n", "Children: []\n", "\n", "Word: illness\n", "Tag: NN\n", "Head: from\n", "Dependency relation: pobj\n", "Children: [his]\n", "\n", "Word: and\n", "Tag: CC\n", "Head: recovered\n", "Dependency relation: cc\n", "Children: []\n", "\n", "Word: is\n", "Tag: VBZ\n", "Head: recovered\n", "Dependency relation: conj\n", "Children: [continually, on, ,, watching]\n", "\n", "Word: continually\n", "Tag: RB\n", "Head: is\n", "Dependency relation: advmod\n", "Children: []\n", "\n", "Word: on\n", "Tag: IN\n", "Head: is\n", "Dependency relation: prep\n", "Children: [deck]\n", "\n", "Word: the\n", "Tag: DT\n", "Head: deck\n", "Dependency relation: det\n", "Children: []\n", "\n", "Word: deck\n", "Tag: NN\n", "Head: on\n", "Dependency relation: pobj\n", "Children: [the]\n", "\n", "Word: ,\n", "Tag: ,\n", "Head: is\n", "Dependency relation: punct\n", "Children: [\n", "]\n", "\n", "Word: \n", "\n", "Tag: _SP\n", "Head: ,\n", "Dependency relation: \n", "Children: []\n", "\n", "Word: apparently\n", "Tag: RB\n", "Head: watching\n", "Dependency relation: advmod\n", "Children: []\n", "\n", "Word: watching\n", "Tag: VBG\n", "Head: is\n", "Dependency relation: advcl\n", "Children: [apparently, for]\n", "\n", "Word: for\n", "Tag: IN\n", "Head: watching\n", "Dependency relation: prep\n", "Children: [sledge]\n", "\n", "Word: the\n", "Tag: DT\n", "Head: sledge\n", "Dependency relation: det\n", "Children: []\n", "\n", "Word: sledge\n", "Tag: NN\n", "Head: for\n", "Dependency relation: pobj\n", "Children: [the, preceded]\n", "\n", "Word: that\n", "Tag: WDT\n", "Head: preceded\n", "Dependency relation: nsubj\n", "Children: []\n", "\n", "Word: preceded\n", "Tag: VBD\n", "Head: sledge\n", "Dependency relation: relcl\n", "Children: [that, own]\n", "\n", "Word: his\n", "Tag: PRP$\n", "Head: own\n", "Dependency relation: poss\n", "Children: []\n", "\n", "Word: own\n", "Tag: JJ\n", "Head: preceded\n", "Dependency relation: dobj\n", "Children: [his]\n", "\n", "Word: .\n", "Tag: .\n", "Head: recovered\n", "Dependency relation: punct\n", "Children: []\n" ] } ], "source": [ "sent = random.choice(sentences)\n", "print(\"Original sentence:\", sent.text.replace(\"\\n\", \" \"))\n", "for word in sent:\n", " print()\n", " print(\"Word:\", word.text)\n", " print(\"Tag:\", word.tag_)\n", " print(\"Head:\", word.head.text)\n", " print(\"Dependency relation:\", word.dep_)\n", " print(\"Children:\", list(word.children))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a list of a few dependency relations and what they mean, for quick reference:\n", "\n", "* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb\n", "* `nsubjpass`: same as above, but for subjects in sentences in the passive voice\n", "* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb\n", "* `iobj`: same as above, but indirect object\n", "* `aux`: this word's head is a verb, and this word is an \"auxiliary\" verb (like \"have\", \"will\", \"be\")\n", "* `attr`: this word's head is a copula (like \"to be\"), and this is the description attributed to the subject of the sentence (e.g., in \"This product is a global brand\", `brand` is dependent on `is` with the `attr` dependency relation)\n", "* `det`: this word's head is a noun, and this word is a determiner of that noun (like \"the,\" \"this,\" etc.)\n", "* `amod`: this word's head is a noun, and this word is an adjective describing that noun\n", "* `prep`: this word is a preposition that modifies its head\n", "* `pobj`: this word is a dependent (object) of a preposition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using .subtree for extracting syntactic units\n", "\n", "That's all pretty abstract, so let's get a bit more concrete, and write some code that will let us extract syntactic units based on their dependency relation. There are a couple of things we need in order to do this. The `.subtree` attribute I used in the code above evaluates to a generator that can be flatted by passing it to `list()`. This is a list of the word's syntactic dependents—essentially, the \"clause\" that the word belongs to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function merges a subtree and returns a string with the text of the words contained in it:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "def flatten_subtree(st):\n", " return ''.join([w.text_with_ws for w in list(st)]).strip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence. (Again, this code is just here to demonstrate what the process of grabbing subtrees looks like—it doesn't do anything useful yet!)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original sentence: I saw an insurmountable barrier placed between me and my fellow men; this barrier was sealed with the blood of William and Justine, and to reflect on the events connected with those names filled my soul with anguish. \n", "\n", "Word: I\n", "Flattened subtree: I\n", "\n", "Word: saw\n", "Flattened subtree: I saw an insurmountable barrier placed between me and my fellow men\n", "\n", "Word: an\n", "Flattened subtree: an\n", "\n", "Word: insurmountable\n", "Flattened subtree: insurmountable\n", "\n", "Word: barrier\n", "Flattened subtree: an insurmountable barrier placed between me and my fellow men\n", "\n", "Word: placed\n", "Flattened subtree: placed between me and my fellow men\n", "\n", "Word: between\n", "Flattened subtree: between me and my fellow men\n", "\n", "Word: me\n", "Flattened subtree: me and my fellow men\n", "\n", "Word: and\n", "Flattened subtree: and\n", "\n", "Word: my\n", "Flattened subtree: my\n", "\n", "Word: \n", "Flattened subtree: \n", "\n", "Word: fellow\n", "Flattened subtree: fellow\n", "\n", "Word: men\n", "Flattened subtree: my fellow men\n", "\n", "Word: ;\n", "Flattened subtree: ;\n", "\n", "Word: this\n", "Flattened subtree: this\n", "\n", "Word: barrier\n", "Flattened subtree: this barrier\n", "\n", "Word: was\n", "Flattened subtree: was\n", "\n", "Word: sealed\n", "Flattened subtree: I saw an insurmountable barrier placed between me and my fellow men; this barrier was sealed with the blood of William and Justine, and to reflect on the events connected with those names filled my soul with anguish.\n", "\n", "Word: with\n", "Flattened subtree: with the blood of William and Justine\n", "\n", "Word: the\n", "Flattened subtree: the\n", "\n", "Word: blood\n", "Flattened subtree: the blood of William and Justine\n", "\n", "Word: of\n", "Flattened subtree: of William and Justine\n", "\n", "Word: William\n", "Flattened subtree: William and Justine\n", "\n", "Word: and\n", "Flattened subtree: and\n", "\n", "Word: \n", "Flattened subtree: \n", "\n", "Word: Justine\n", "Flattened subtree: Justine\n", "\n", "Word: ,\n", "Flattened subtree: ,\n", "\n", "Word: and\n", "Flattened subtree: and\n", "\n", "Word: to\n", "Flattened subtree: to\n", "\n", "Word: reflect\n", "Flattened subtree: to reflect on the events connected with those names filled my soul with anguish\n", "\n", "Word: on\n", "Flattened subtree: on the events connected with those names filled my soul with anguish\n", "\n", "Word: the\n", "Flattened subtree: the\n", "\n", "Word: events\n", "Flattened subtree: the events connected with those names filled my soul with anguish\n", "\n", "Word: connected\n", "Flattened subtree: connected with those names filled my soul with anguish\n", "\n", "Word: with\n", "Flattened subtree: with those names filled my soul with anguish\n", "\n", "Word: those\n", "Flattened subtree: those\n", "\n", "Word: names\n", "Flattened subtree: those names filled my soul with anguish\n", "\n", "Word: filled\n", "Flattened subtree: filled my soul with anguish\n", "\n", "Word: \n", "Flattened subtree: \n", "\n", "Word: my\n", "Flattened subtree: my\n", "\n", "Word: soul\n", "Flattened subtree: my soul\n", "\n", "Word: with\n", "Flattened subtree: with anguish\n", "\n", "Word: anguish\n", "Flattened subtree: anguish\n", "\n", "Word: .\n", "Flattened subtree: .\n", "\n", "Word: \n", "Flattened subtree: \n" ] } ], "source": [ "sent = random.choice(sentences)\n", "print(\"Original sentence:\", sent.text.replace(\"\\n\", \" \"))\n", "for word in sent:\n", " print()\n", " print(\"Word:\", word.text.replace(\"\\n\", \" \"))\n", " print(\"Flattened subtree: \", flatten_subtree(word.subtree).replace(\"\\n\", \" \"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "subjects = []\n", "for word in doc:\n", " if word.dep_ in ('nsubj', 'nsubjpass'):\n", " subjects.append(flatten_subtree(word.subtree))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Justine',\n", " 'that',\n", " 'which',\n", " 'The hour of my irresolution',\n", " 'Beaufort',\n", " 'the wind that blew me from the detested\\nshore of Ireland, and the sea which surrounded me',\n", " 'Immense and\\nrugged mountains of ice',\n", " 'I',\n", " 'that',\n", " 'I',\n", " 'I',\n", " 'I']" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.sample(subjects, 12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or every prepositional phrase:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "prep_phrases = []\n", "for word in doc:\n", " if word.dep_ == 'prep':\n", " prep_phrases.append(flatten_subtree(word.subtree).replace(\"\\n\", \" \"))" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['of the copyright holder',\n", " 'of preservation',\n", " 'in torrents',\n", " 'in England',\n", " 'from my sight',\n", " 'at a short distance from the shore',\n", " 'of',\n", " 'as an occurrence which no accident could possibly prevent',\n", " 'by one',\n", " 'of the beauty',\n", " 'with the tenderest compassion',\n", " 'of revenge and hatred']" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random.sample(prep_phrases, 12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating text from extracted units\n", "\n", "One thing I like to do is put together text from parts we've disarticulated with spaCy. Let's use Tracery to do this. If you don't know how to use Tracery, feel free to consult [my Tracery tutorial](tracery-and-python.ipynb) before continuing.\n", "\n", "So I want to generate sentences based on things that I've extracted from my text. My first idea: get subjects of sentences, verbs of sentences, nouns and adjectives, and prepositional phrases:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "subjects = [flatten_subtree(word.subtree).replace(\"\\n\", \" \")\n", " for word in doc if word.dep_ in ('nsubj', 'nsubjpass')]\n", "past_tense_verbs = [word.text for word in words if word.tag_ == 'VBD' and word.lemma_ != 'be']\n", "adjectives = [word.text for word in words if word.tag_.startswith('JJ')]\n", "nouns = [word.text for word in words if word.tag_.startswith('NN')]\n", "prep_phrases = [flatten_subtree(word.subtree).replace(\"\\n\", \" \")\n", " for word in doc if word.dep_ == 'prep']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notes on the code above:\n", "\n", "* The `.replace(\"\\n\", \" \")` is in there because spaCy treats linebreaks as normal whitespace, and retains them when we ask for the span's text. For formatting reasons, we want to get rid of this.\n", "* I'm using `.startswith()` in the checks for parts of speech in order to capture other related parts of speech (e.g., `JJR` is comparative adjectives, `NNS` is plural nouns).\n", "* I use only past tense verbs so we don't have to worry about subject/verb agreement in English. I'm excluding forms of *to be* because it is the only verb that agrees with its subject in the past tense.\n", "\n", "Now I'll import Tracery. If you haven't already installed it, you can do so using the following cell:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: tracery in /Users/allison/opt/miniconda3/envs/rwet-2022/lib/python3.8/site-packages (0.1.1)\r\n" ] } ], "source": [ "import sys\n", "!{sys.executable} -m pip install tracery" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "import tracery\n", "from tracery.modifiers import base_english" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and define a grammar. The \"trick\" of this example is that I grab entire rule expansions from the units extracted from the text using spaCy. The grammar itself is built around producing sentences that look and feel like English." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Of my creature, he knew an immutable victim.'" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rules = {\n", " \"origin\": [\n", " \"#subject.capitalize# #predicate#.\",\n", " \"#subject.capitalize# #predicate#.\",\n", " \"#prepphrase.capitalize#, #subject# #predicate#.\"\n", " ],\n", " \"predicate\": [\n", " \"#verb#\",\n", " \"#verb# #nounphrase#\",\n", " \"#verb# #prepphrase#\"\n", " ],\n", " \"nounphrase\": [\n", " \"the #noun#\",\n", " \"the #adj# #noun#\",\n", " \"the #noun# #prepphrase#\",\n", " \"the #noun# and the #noun#\",\n", " \"#noun.a#\",\n", " \"#adj.a# #noun#\",\n", " \"the #noun# that #predicate#\"\n", " ],\n", " \"subject\": subjects,\n", " \"verb\": past_tense_verbs,\n", " \"noun\": nouns,\n", " \"adj\": adjectives,\n", " \"prepphrase\": prep_phrases\n", "}\n", "grammar = tracery.Grammar(rules)\n", "grammar.add_modifiers(base_english)\n", "grammar.flatten(\"#origin#\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's generate a whole paragraph of this and format it nicely:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The very accents of love seemed. Of some uncontrollable\n", "passion, his hope and his dream dreaded upon. I had. For\n", "many years, I had without further opportunities to fix the\n", "problem. The relation of my disasters assumed the vengeance\n", "that had in you. This history made the contrast and the\n", "hope. We placed. For nearly any purpose such as creation of\n", "derivative works, reports, performances and research, I said\n", "the blackest Begone. I did of this tour about the end of\n", "July. I disinclined the September into his heart. The moon\n", "gained the spring. His countenance looked in other respects.\n" ] } ], "source": [ "from textwrap import fill\n", "output = \" \".join([grammar.flatten(\"#origin#\") for i in range(12)])\n", "print(fill(output, 60))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I like this approach for a number of reasons. Because I'm using a hand-written grammar, I have a great deal of control over the shape and rhythm of the sentences that are generated. But spaCy lets me pre-populate my grammar's vocabulary without having to write each item by hand." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further reading and resources\n", "\n", "We've barely scratched the surface of what it's possible to do with spaCy. [The official site has a good list of guides to various natural language processing tasks](https://spacy.io/usage) that you should check out, and there are also [a handful of books that dig deeper into using spaCy for natural language processing tasks](https://spacy.io/universe/category/books)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }