{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 15: Off to analyzing text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Way to go! You have already learned a lot of essential components of the Python language. Being able to deal with data structures, import packages, build your own functions and operate with files is not only essential for most tasks in Python, but also a prerequisite for text analysis. We have applied some common preprocessing steps like casefolding/lowercasing, punctuation removal, and stemming/lemmatization. Did you know that there are some very useful NLP packages and modules that do some of these steps? One that is often used in text analysis is the Python package **NLTK (the Natural Language Toolkit)**.\n",
    "\n",
    "### At the end of this chapter, you will be able to:\n",
    "* have an idea of the NLP tasks that constitute an NLP pipeline\n",
    "* use the functions of the NLTK module to manipulate the content of files for NLP purposes (e.g. sentence splitting, tokenization, POS-tagging, and lemmatization);\n",
    "* do nesting of multiple for-loops or files\n",
    "\n",
    "### More NLP software for Python:\n",
    "* [NLTK](http://www.nltk.org/)\n",
    "* [SpaCy](https://spacy.io/)\n",
    "* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/index.html)\n",
    "* [About Python NLP libraries](https://elitedatascience.com/python-nlp-libraries)\n",
    "\n",
    "\n",
    "If you have **questions** about this chapter, please contact us **(cltl.python.course@gmail.com)**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1 A short intro to text processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many aspects of text we can (try to) analyze. Commonly used analyses conducted in Natural Language Processing (**NLP**) are for instance:\n",
    "\n",
    "* determining the part of speech of words in a text (verb, noun, etc.)\n",
    "* analyzing the syntactic relations between words and phrases in a sentence (i.e., syntactic parsing)\n",
    "* analyzing which entities (people, organizations, locations) are mentioned in a text\n",
    "\n",
    "...and many more. Each of these aspects is addressed within its own **NLP task**. \n",
    "\n",
    "**The NLP pipeline**\n",
    "\n",
    "Usually, these tasks are carried out sequentially because they depend on each other. For instance, we need to first tokenize the text (split it into words) in order to be able to assign part-of-speech tags to each word. This sequence is often called an **NLP pipeline**. For example, a general pipeline could consist of the components shown below (taken from [here](https://www.slideshare.net/YuriyGuts/natural-language-processing-nlp)) You can see the NLP pipeline of the NewsReader project [here](http://www.newsreader-project.eu/files/2014/02/SystemArchitecture.png). (you can ignore the middle part of the picture, and focus on the blue and green boxes in the outer row).\n",
    "\n",
    "<img src='images/nlp-pipeline.jpg'>\n",
    "\n",
    "In this chapter we will look into four simple NLP modules that are nevertheless very common in NLP: **tokenization, sentence splitting**, **lemmatization** and **POS tagging**. \n",
    "\n",
    "There are also more advanced processing modules out there - feel free to do some research yourself :-) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 The NLTK package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NLTK (Natural Language Processing Toolkit) is a module we can use for most fundamental aspects of natural language processing. There are many more advanced approaches out there, but it is a good way of getting started. \n",
    "\n",
    "Here we will show you how to use it for tokenization, sentence splitting, POS tagging, and lemmatization. These steps are necessary processing steps for most NLP tasks. \n",
    "\n",
    "We will first give you an overview of all tasks and then delve into each of them in more detail. \n",
    "\n",
    "Before we can use NLTK for the first time, we have to make sure it is downloaded and installed on our computer (some of you may have already done this). \n",
    "\n",
    "To install NLTK, please try to run the following two cells. If this does not work, please try and follow the [documentation](http://www.nltk.org/install.html). If you don't manage to get this to work, please ask for help. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "pip install nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have downloaded the NLTK book, you do not need to run the download again. If you are using the NLTK again, it is sufficient to import it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# downloading nltk\n",
    "\n",
    "import nltk\n",
    "nltk.download('book')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have installed and downloaded NLTK, let's look at an example of a simple NLP pipeline. In the following cell, you can observe how we tokenize raw text into tokens and setnences, perform part of speech tagging and lemmatize some of the tokens. Don't worry about the details just yet - we will go trhough them step by step. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This example sentence is used for illustrating some basic NLP tasks. Language is awesome!\"\n",
    "\n",
    "# Tokenization\n",
    "tokens = nltk.word_tokenize(text)\n",
    "\n",
    "# Sentence splitting\n",
    "sentences = nltk.sent_tokenize(text)\n",
    "\n",
    "# POS tagging\n",
    "tagged_tokens = nltk.pos_tag(tokens)\n",
    "\n",
    "# Lemmatization\n",
    "lmtzr = nltk.stem.wordnet.WordNetLemmatizer()\n",
    "lemma=lmtzr.lemmatize(tokens[4], 'v')\n",
    "\n",
    "# Printing all information\n",
    "print(tokens)\n",
    "print(sentences)\n",
    "print(tagged_tokens)\n",
    "print(lemma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 Tokenization and sentence splitting with NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1.1 `word_tokenize()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's try tokenizing our Charlie story! First, we will open and read the file again and assign the file contents to the variable `content`. Then, we can call the `word_tokenize()` function from the `nltk` module as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open(\"../Data/Charlie/charlie.txt\") as infile:\n",
    "    content = infile.read()\n",
    "\n",
    "tokens = nltk.word_tokenize(content)\n",
    "print(type(tokens), len(tokens))\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1.2 `sent_tokenize()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another thing that NLTK can do for you is to split a text into sentences by using the `sent_tokenize()` function. We use it on the entire text (as a string):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"../Data/Charlie/charlie.txt\") as infile:\n",
    "    content = infile.read()\n",
    "\n",
    "sentences = nltk.sent_tokenize(content)\n",
    "\n",
    "print(type(sentences), len(sentences))\n",
    "print(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open and read in file as a string, assign it to the variable `content`\n",
    "with open(\"../Data/Charlie/charlie.txt\") as infile:\n",
    "    content = infile.read()\n",
    "    \n",
    "# Split up entire text into tokens using word_tokenize():\n",
    "tokens = nltk.word_tokenize(content)\n",
    "\n",
    "# create an empty list to collect all words having the present participle -ing:\n",
    "present_participles = []\n",
    "\n",
    "# looking through all tokens\n",
    "for token in tokens:\n",
    "    # checking if a token ends with the present parciciple -ing\n",
    "    if token.endswith(\"ing\"):\n",
    "        # if the condition is met, add it to the list we created above (present_participles)\n",
    "        present_participles.append(token)\n",
    "        \n",
    "# Print the list to inspect it\n",
    "print(present_participles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks good! We now have a list of words like *boiling*, *sizzling*, etc. However, we can see that there is one word in the list that actually is not a present participle (*ceiling*). Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2. Part-of-speech (POS) tagging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, NLTK comes to the rescue. Using the function `pos_tag()`, we can label each word in the text with its part of speech. \n",
    "\n",
    "To do pos-tagging, you first need to tokenize the text. We have already done this above, but we will repeat the steps here, so you get a sense of what an NLP pipeline may look like."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2.1 `pos_tag()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see how `pos_tag()` can be used, we can (as always) look at the documentation by using the `help()` function. As we can see, `pos_tag()` takes a tokenized text as input and returns a list of tuples in which the first element corresponds to the token and the second to the assigned pos-tag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# As always, we can start by reading the documentation:\n",
    "help(nltk.pos_tag)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open and read in file as a string, assign it to the variable `content`\n",
    "with open(\"../Data/Charlie/charlie.txt\") as infile:\n",
    "    content = infile.read()\n",
    "    \n",
    "# Split up entire text into tokens using word_tokenize():\n",
    "tokens = nltk.word_tokenize(content)\n",
    "\n",
    "# Apply pos tagging to the tokenized text\n",
    "tagged_tokens = nltk.pos_tag(tokens)\n",
    "\n",
    "# Inspect pos tags\n",
    "print(tagged_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2.2 Working with POS tags"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we saw above, `pos_tag()` returns a list of tuples: The first element is the token, the second element indicates the part of speech (POS) of the token. \n",
    "\n",
    "This POS tagger uses the POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). For example, all tags starting with a V are used for verbs. \n",
    "\n",
    "We can now use this, for example, to identify all the verbs in a text:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open and read in file as a string, assign it to the variable `content`\n",
    "with open(\"../Data/Charlie/charlie.txt\") as infile:\n",
    "    content = infile.read()\n",
    "    \n",
    "# Apply tokenization and POS tagging\n",
    "tokens = nltk.word_tokenize(content)\n",
    "tagged_tokens = nltk.pos_tag(tokens)\n",
    "\n",
    "# List of verb tags (i.e. tags we are interested in)\n",
    "verb_tags = [\"VBD\", \"VBG\", \"VBN\", \"VBP\", \"VBZ\"]\n",
    "\n",
    "# Create an empty list to collect all verbs:\n",
    "verbs = []\n",
    "\n",
    "# Iterating over all tagged tokens\n",
    "for token, tag in tagged_tokens:\n",
    " \n",
    "    # Checking if the tag is any of the verb tags\n",
    "    if tag in verb_tags:\n",
    "        # if the condition is met, add it to the list we created above \n",
    "        verbs.append(token)\n",
    "        \n",
    "# Print the list to inspect it\n",
    "print(verbs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3. Lemmatization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use NLTK to lemmatize words.\n",
    "\n",
    "The lemma of a word is the form of the word which is usually used in dictionary entries. This is useful for many NLP tasks, as it gives a better generalization than the strong a word appears in. To a computer, `cat` and `cats` are two completely different tokens, even though we know they are both forms of the same lemma. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3.1 The WordNet lemmatizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the WordNetLemmatizer for this using the `lemmatize()` function. In the code below, we loop through the list of verbs, lemmatize each of the verbs, and add them to a new list called `verb_lemmas`. Again, we show all the processing steps (consider the comments in the code below):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#################################################################################\n",
    "#### Process text as explained above ###\n",
    "\n",
    "with open(\"../Data/Charlie/charlie.txt\") as infile:\n",
    "    content = infile.read()\n",
    "    \n",
    "tokens = nltk.word_tokenize(content)\n",
    "tagged_tokens = nltk.pos_tag(tokens)\n",
    "\n",
    "verb_tags = [\"VBD\", \"VBG\", \"VBN\", \"VBP\", \"VBZ\"]\n",
    "verbs = []\n",
    "\n",
    "for token, tag in tagged_tokens:\n",
    "    if tag in verb_tags:\n",
    "        verbs.append(token)\n",
    "\n",
    "print(verbs)\n",
    "\n",
    "#############################################################################\n",
    "#### Use the list of verbs collected above to lemmatize all the verbs ###\n",
    "\n",
    "        \n",
    "# Instatiate a lemmatizer object\n",
    "lmtzr = nltk.stem.wordnet.WordNetLemmatizer()\n",
    "\n",
    "# Create list to collect all the verb lemmas:\n",
    "verb_lemmas = []\n",
    "        \n",
    "for participle in verbs:\n",
    "    # For this lemmatizer, we need to indicate the POS of the word (in this case, v = verb)\n",
    "    lemma = lmtzr.lemmatize(participle, \"v\") \n",
    "    verb_lemmas.append(lemma)\n",
    "print(verb_lemmas)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note about the wordnet lemmatizer:** \n",
    "\n",
    "We need to specify a POS tag to the WordNet lemmatizer, in a WordNet format (\"n\" for noun, \"v\" for verb, \"a\" for adjective). If we do not indicate the Part-of-Speech tag, the WordNet lemmatizer thinks it is a noun (this is the default value for its part-of-speech). See the examples below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_nouns = ('building', 'applications', 'leafs')\n",
    "for n in test_nouns:\n",
    "    print(f\"Noun in conjugated form: {n}\")\n",
    "    default_lemma=lmtzr.lemmatize(n) # default lemmatization, without specifying POS, n is interpretted as a noun!\n",
    "    print(f\"Default lemmatization: {default_lemma}\")\n",
    "    verb_lemma=lmtzr.lemmatize(n, 'v')\n",
    "    print(f\"Lemmatization as a verb: {verb_lemma}\")\n",
    "    noun_lemma=lmtzr.lemmatize(n, 'n')\n",
    "    print(f\"Lemmatization as a noun: {noun_lemma}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_verbs=('does', 'standing', 'plays')\n",
    "for v in test_verbs:\n",
    "    print(f\"Verb in conjugated form: {v}\")\n",
    "    default_lemma=lmtzr.lemmatize(v) # default lemmatization, without specifying POS, v is interpretted as a noun!\n",
    "    print(f\"Default lemmatization: {default_lemma}\")\n",
    "    verb_lemma=lmtzr.lemmatize(v, 'v')\n",
    "    print(f\"Lemmatization as a verb: {verb_lemma}\")\n",
    "    noun_lemma=lmtzr.lemmatize(v, 'n')\n",
    "    print(f\"Lemmatization as a noun: {noun_lemma}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 Nesting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we typically used a single for-loop, or we were opening a single file at a time. In Python (and most programming languages), one can **nest** multiple loops or files in one another. For instance, we can use one (outer) for-loop to iterate through files, and then for each file iterate through all its sentences (internal for-loop). As we have learned above, `glob` is a convenient way of creating a list of files. \n",
    "\n",
    "You might think: can we stretch this on more levels? Iterate through files, then iterate through the sentences in these files, then iterate through each word in these sentences, then iterate through each letter in these words, etc.  This is possible. Python (and most programming languages) allow you to perform nesting with (in theory) as many loops as you want. Keep in mind that nesting too much will eventually cause computational problems, but this also depends on the size of your data. \n",
    "\n",
    "For the tasks we are treating here, a a couple of levels of nesting are fine. \n",
    "\n",
    "In the code below, we want get an idea of the number and length of the sentences in the texts stored in the `../Data/dreams` directory. We do this by creating two for loops: We iterate over all the files in the directory (loop 1), apply sentence tokenization and iterate over all the sentences in the file (loop 2).\n",
    "\n",
    "Look at the code and comments below to figure out what is going on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "\n",
    "### Loop 1 ####\n",
    "# Loop1: iterate over all the files in the dreams directory\n",
    "for filename in glob.glob(\"../Data/dreams/*.txt\"): \n",
    "    # read in the file and assign the content to a variable\n",
    "    with open(filename, \"r\") as infile:\n",
    "        content = infile.read()\n",
    "    # split the content into sentences\n",
    "    sentences = nltk.sent_tokenize(content) \n",
    "    # Print the number of sentences in the file\n",
    "    print(f\"INFO: File {filename} has {len(sentences)} sentences\")     \n",
    "\n",
    "    # For each file, assign a number to each sentence. Start with 0:\n",
    "    counter=0\n",
    "\n",
    "    #### Loop 2 ####\n",
    "    # Loop 2: loop over all the sentences in a file:\n",
    "    for sentence in sentences:\n",
    "         # add 1 to the counter\n",
    "        counter+=1   \n",
    "         # tokenize the sentence\n",
    "        tokens=nltk.word_tokenize(sentence) \n",
    "        # print the number of tokens per sentence\n",
    "        print(f\"Sentence {counter} has {len(tokens)} tokens\")   \n",
    "               \n",
    "    # print an empty line after each file (this belongs to loop 1)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 Putting it all together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we will use what we have learned above to write a small NLP program. We will go through all the steps and show how they can be put together. In the last chapters, we have already learned how to write functions. We will make use of this skill here. \n",
    "\n",
    "Our goal is to collect all the nouns from Vickie's dream reports. \n",
    "\n",
    "Before we write actual code, it is always good to consider which steps we need to carry out to reach the goal. \n",
    "\n",
    "Important steps to remember:\n",
    "\n",
    "* create a list of all the files we want to process\n",
    "* open and read the files\n",
    "* tokenize the texts\n",
    "* perform pos-tagging\n",
    "* collect all the tokens analyzed as nouns\n",
    "\n",
    "Remember, we first needed to import `nltk` to use it. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1 Writing a processing function for a single file\n",
    "\n",
    "Since we want to carry out the same task for each of the files, it is very useful (and good practice!) to write a single function which can do the processing. The following function reads the specified file and returns the tokens with their POS tags:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "\n",
    "def tag_tokens_file(filepath):\n",
    "    \"\"\"Read the contents of the file found at the location specified in \n",
    "    FILEPATH and return a list of its tokens with their POS tags.\"\"\"\n",
    "    with open(filepath, \"r\") as infile:\n",
    "        content = infile.read()\n",
    "        tokens = nltk.word_tokenize(content)\n",
    "        tagged_tokens = nltk.pos_tag(tokens)\n",
    "    return tagged_tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, instead of having to open a file, read the contents and close the file, we can just call the function `tag_tokens_file` to do this. We can test it on a single file: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"../Data/dreams/vickie1.txt\"\n",
    "tagged_tokens = tag_tokens_file(filename)\n",
    "print(tagged_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Iterating over all the files and applying the processing function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also do this for each of the files in the `../Data/dreams` directory by using a for-loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "\n",
    "# Iterate over the `.txt` files in the directory and perform POS tagging on each of them\n",
    "for filename in glob.glob(\"../Data/dreams/*.txt\"): \n",
    "    tagged_tokens = tag_tokens_file(filename)\n",
    "    print(filename, \"\\n\", tagged_tokens, \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 Collecting all the nouns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we extend this code a bit so that we don't print all POS-tagged tokens of each file, but we get all (proper) nouns from the texts and add them to a list called `nouns_in_dreams`. Then, we print the set of nouns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a list that will contain all nouns\n",
    "nouns_in_dreams = []\n",
    "\n",
    "# Iterate over the `.txt` files in the directory and perform POS tagging on each of them\n",
    "for filename in glob.glob(\"../Data/dreams/*.txt\"): \n",
    "    tagged_tokens = tag_tokens_file(filename)\n",
    "        \n",
    "    # Get all (proper) nouns in the text (\"NN\" and \"NNP\") and add them to the list\n",
    "    for token, pos in tagged_tokens:\n",
    "        if pos in [\"NN\", \"NNP\"]:\n",
    "            nouns_in_dreams.append(token)\n",
    "\n",
    "# Print the set of nouns in all dreams\n",
    "print(set(nouns_in_dreams))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have an idea what Vickie dreams about!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 1:** \n",
    "\n",
    "Try to collect all the present participles in the the text store in `../Data/Charlie/charlie.txt` using the NLTK tokenizer and POS-tagger. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# you code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should get the following list: \n",
    "`['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can test our code using the assert statement (don't worry about this now, \n",
    "# but if you want to use it, you can probably figure out how it works yourself :-) \n",
    "# If our code is correct, we should get a compliment :-)\n",
    "assert len(present_participles) == 11 and type(present_participles[0]) == str\n",
    "print(\"Well done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 2:** \n",
    "\n",
    "The resulting list `verb_lemmas` above contains a lot of duplicates. Do you remember how you can get rid of these duplicates? Create a set in which each verb occurs only once and name it `unique_verbs`. Then print it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## the list is stored under the variable 'verb_lemmas'\n",
    "\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test your code here! If your code is correct, you should get a compliment :-)\n",
    "assert len(unique_verbs) == 28    \n",
    "print(\"Well done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 3:** \n",
    "\n",
    "Now use a for-loop to count the number of times that each of these verb lemmas occurs in the text! For each verb in the list you just created, get the count of this verb in `charlie.txt` using the `count()` method. Create a dictionary that contains the lemmas of the verbs as keys, and the counts of these verbs as values. Refer to the notebook about Topic 1 if you forgot how to use the `count()` method or how to create dictionary entries!\n",
    "\n",
    "Tip: you don't need to read in the file again, you can just use the list called verb_lemmas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "verb_counts = {}\n",
    "\n",
    "# Finish this for-loop\n",
    "for verb in unique_verbs:\n",
    "    # your code here\n",
    "    \n",
    "print(verb_counts) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test your code here! If your code is correct, you should get a compliment :-)\n",
    "assert len(verb_counts) == 28 and verb_counts[\"bubble\"] == 1 and verb_counts[\"be\"] == 9\n",
    "print(\"Well done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 4:**\n",
    "    \n",
    "Write your counts to a file called `charlie_verb_counts.txt` and write it to `../Data/Charlie/charlie_verb_counts.txt` in the following format:\n",
    "\n",
    "verb, count\n",
    "\n",
    "verb, count \n",
    "\n",
    "...\n",
    "\n",
    "Don't forget to use newline characters at the end of each line. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}