{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "# Assignment 3b: Writing your own nlp program" }, { "metadata": {}, "cell_type": "markdown", "source": [ "\n", "**IMPORTANT NOTE**:\n", "* The students who follow the Bachelor version of this course, i.e., the course Introduction to Python for Humanities and Social Sciences (L_AABAALG075) as part of the minor Digital Humanities, do **NOT have to do Exercises 3 and 4 of Assignment 3b** (Feel free to try it as extra practice though.)\n", "* The other students, who follow the Master version of Programming in Python for Text Analysis (L_AAMPLIN021), should **DO Exercises 3 and 4 of Assignment 3b**\n", "\n", "In this part of the assignment, we will carry out our own little text analysis project. The goal is to gain some insights into longer texts without having to read them all in detail.\n", "\n", "This part of the assignment builds on some notions that have been revised in part A of the assignment. Please feel free to go back to part A and reuse your code whenever possible.\n", "\n", "The goals of this part are:\n", "\n", "* divide a problem into smaller sub-problems and test code using small examples\n", "* doing text analysis and writing results to a file\n", "* combining small functions into bigger functions\n", "\n", "**Tip**: The assignment is split into four steps, which are divided into smaller steps. Instead of doing everything step by step, we highly recommend you read all sub-steps of a step first and then start coding. In many cases, the sub-steps are there to help you split the problem into manageable sub-problems, but it is still good to keep the overall goal in mind." ] }, { "metadata": {}, "cell_type": "markdown", "source": [ "Use the following cell to download three different books from the Project Gutenberg website. In theory, you can reuse most of this code to download any .txt files from different websites if you knnow their specific URL (web address).\n", "\n", "We defined a function called `download_book` which downloads a book in .txt format. Then, we define a dictionary with names and urls. We loop through the dictionary, download each book and write it to a file stored in the directory `books` in the current working directory. You don't need to do anything - just run the cell and the files will be downloaded to your computer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Downloading data - you get this for free :-) \n", "\n", "import requests\n", "import os\n", "\n", "\n", "def download_book(url):\n", " \"\"\"\n", " Download book given a url to a book in .txt format and return it as a string.\n", " \"\"\"\n", " text_request = requests.get(url)\n", " text = text_request.text\n", " return text\n", "\n", "\n", "book_urls = dict()\n", "book_urls['HuckFinn'] = 'http://www.gutenberg.org/cache/epub/7100/pg7100.txt'\n", "book_urls['Macbeth'] = 'http://www.gutenberg.org/cache/epub/1533/pg1533.txt'\n", "book_urls['AnnaKarenina'] = 'http://www.gutenberg.org/files/1399/1399-0.txt'\n", "\n", "\n", "if not os.path.isdir('../Data/books/'):\n", " os.mkdir('../Data/books/')\n", " \n", " \n", "for name, url in book_urls.items():\n", " text = download_book(url)\n", " with open('../Data/books/'+name+'.txt', 'w', encoding='utf-8') as outfile:\n", " outfile.write(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Encoding issues with txt files**\n", "\n", "For Windows users, the file “AnnaKarenina.txt” gets the encoding cp1252. \n", "In order to open the file, you have to add **encoding='utf-8'**, i.e.,\n", "\n", "```python\n", "a_path = 'some path on your computer.txt'\n", "with open(a_path, mode='r', encoding='utf-8'):\n", " # process file\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1\n", "Was the download successful? Let's start writing code! Please create the following two Python modules:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **analyze.py** will be your main module, which you will call from the command line\n", "* **utils.py** will be a module to contain helper functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More precisely, please create two files, **analyze.py** and **utils.py**, which are both placed in the same directory as this notebook. The two files are empty at this stage of the assignment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.a) Write a function called `get_paths` and store it in the Python module **utils.py**\n", "\n", "The function `get_paths`:\n", "* takes one positional parameter called *input_folder*\n", "* the function stores all paths to .txt files in the *input_folder* in a list [it does **not** need to search recursively inside this folder]\n", "* the function returns a list of strings, i.e., each string is a file path\n", "\n", "Once you've created the function and stored it in **utils.py**:\n", "* Import the function into **analyze.py**, using `from utils import get_paths`\n", "* Call the function inside **analyze.py** (input_folder=\"../Data/books\")\n", "* Assign the output of the function to a variable and print this variable.\n", "* call **analyze.py** from the command line to test it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2\n", "2.a) Let's get a little bit of an overview of what we can find in each text. Write a function called `get_basic_stats`.\n", "\n", "The function `get_basic_stats`:\n", "* has one positional parameter called **txt_path** which is the path to a txt file\n", "* reads the content of the txt file into a string\n", "* Computes the following statistics:\n", " * The number of sentences\n", " * The number of tokens\n", " * The size of the vocabulary used (i.e. unique tokens)\n", " * the number of chapters/acts:\n", " * count occurrences of 'CHAPTER' in **HuckFinn.txt**\n", " * count occurrences of 'Chapter ' (with the space) in **AnnaKarenina.txt**\n", " * count occurrences of 'ACT' in **Macbeth.txt**\n", "* return a dictionary with four key:value pairs, one for each statistic described above:\n", " * num_sents\n", " * num_tokens\n", " * vocab_size \n", " * num_chapters_or_acts\n", "\n", "In order to compute the statistics, you need to perform sentence splitting and tokenization. Here is an example snippet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import sent_tokenize, word_tokenize\n", "\n", "text = 'Python is a programming language. It was created by Guido van Rossum.'\n", "for sent in sent_tokenize(text):\n", " print('SENTENCE', sent)\n", " tokens = word_tokenize(sent)\n", " print('TOKENS', tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2.b) Store the function in the Python module **utils.py**. Import it in **analyze.py**. \n", "Edit **analyze.py** so that:\n", "* you first call the function **get_paths**\n", "* create an empty dictionary called **book2stats**, i.e., `book2stats = {}`\n", "* Loop over the list of txt files (the output from **get_paths**) and call the function **get_basic_stats** on each file\n", "* print the output of calling the function **get_basic_stats** on each file.\n", "* update the dictionary **book2stats** with each iteration of the for loop." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tip: **book2stats** is a dictionary mapping a book name (the key), e.g., ‘AnnaKarenina’, to a dictionary (the value) (the output from get_basic_stats)\n", "\n", "Tip: please use the following code snippet to obtain the basename name of a file path:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os \n", "basename = os.path.basename('../Data/books/HuckFinn.txt')\n", "book = basename.split('.')[0]\n", "print(book)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Please note that Exercises 3 and 4, respectively, are designed to be difficult. You will have to combine what you have learnt so far to complete these exercises. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3 (only required for MA)\n", "\n", "\n", "Let's compare the books based on the statistics we calculated. Create a dictionary `stats2book_with_highest_value` in **analyze.py** with four keys:\n", "* num_sents\n", "* num_tokens\n", "* vocab_size \n", "* num_chapters_or_acts\n", "\n", "The values are not the frequencies, but the name of the **book** that has the highest value for the statistic. Make use of the **book2stats** dictionary to accomplish this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4 (Only required for MA)\n", "\n", "4.a) The statistics above already provide some insights, but we want to know a bit more about what the books are about. To do this, we want to get the 30 most frequent tokens of each book. Edit the function `get_basic_stats` to add one more key:value pair:\n", "* the key is **top_30_tokens**\n", "* the value is a list of the 30 most frequent tokens in the text.\n", "\n", "4.b) Use the output of the function `get_basic_stats` to write the top 30 tokens (one on each line) for each book to disk using the naming `top_30_[FILENAME]`:\n", "* top_30_AnnaKarenina.txt\n", "* top_30_HuckFinn.txt\n", "* top_30_Macbeth.txt\n", "\n", "Example of file (*the* and *and* may not be the most frequent tokens, these are just examples):\n", "\n", "```\n", "the \n", "and \n", "..\n", "```\n", "The following code snippet can help you with obtaining the top 30 occurring tokens. The goal is to call the function you updated in Exercise 4a, i.e., get_basic_stats, in the file analyze.py. This also makes it possible to write the top 30 tokens to files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import operator \n", "\n", "token2freq = {'a': 1000, 'the': 100, 'cow' : 5}\n", "for token, freq in sorted(token2freq.items(), \n", " key=operator.itemgetter(1),\n", " reverse=True):\n", " print(token, freq)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Congratulations! You have written your first little nlp program!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "1f37b899a14b1e53256e3dbe85dea3859019f1cb8d1c44a9c4840877cfd0e7ef" } } }, "nbformat": 4, "nbformat_minor": 4 }