{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 3a\n", "\n", "**IMPORTANT NOTE**:\n", "* The students who follow the Bachelor version of this course, i.e., the course Introduction to Python for Humanities and Social Sciences (L_AABAALG075) as part of the minor Digital Humanities, do **NOT have to do Exercises 3 and 4 of Assignment 3b**\n", "* The other students, who follow the Master version of Programming in Python for Text Analysis (L_AAMPLIN021), are required to **DO Exercises 3 and 4 of Assignment 3b**\n", "\n", "In this block, we covered a lot of ground:\n", "\n", "* Chapter 12 - Importing external modules \n", "* Chapter 13 - Working with Python scripts\n", "* Chapter 14 - Reading and writing text files\n", "* Chapter 15 - Off to analyzing text \n", "\n", "\n", "In this assignment, you will first complete a number of small exercises about each chapter to make sure you are familiar with the most important concepts. In the second part of the assignment, you will apply your newly acquired skills to write your very own text processing program (ASSIGNMENT-3b) :-). But don't worry, there will be instructions and hints along the way. \n", "\n", "\n", "**Can I use external modules other than the ones treated so far?**\n", "\n", "For now, please try to avoid it. All the exercises can be solved with what we have covered in block I, II, and III. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions & scope" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excercise 1:\n", "\n", "Define a function called `split_sort_text` which takes one positional parameter called **text** (a string).\n", "\n", "The function:\n", "* splits the string on a space character, i.e., ' ' [not all whitespace!]\n", "* returns all the unique words in alphabetical order as a list.\n", "\n", "* Hint 1: There is a specific python container which does not allow for duplicates and simply removes them. Use this one. \n", "* Hint 2: There is a built-in function which sorts items in an iterable called 'sorted'. Look at the documentation to see how it is used. \n", "* Hint 3: Don't forget to write a docstring. Please make sure that the docstring generally explains with the input is, what the function does, and what the function returns. If you want, but this is not needed to receive full points, you can use [reStructuredText](http://docutils.sourceforge.net/rst.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with external modules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "NLTK offers a way of using WordNet in Python. Do some research (using google, because quite frankly, that's what we do very often) and see if you can find out how to import it. WordNet is a computational lexicon which organizes words according to their senses (collected in synsets). See if you can print all the **synset definitions** of the lemma **dog**.\n", "\n", "Make sure you have run the following cell to make sure you have installed WordNet:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "# uncomment the following line to download material including WordNet\n", "# nltk.download('book')\n", "# nltk.download('omw-1.4')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with python scripts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "\n", "#### a.) Define a function called `my_word_count`, which determines how often each word occurs in a string and returns the result as a python dictionary. Do not use NLTK just yet. Find a way to test it. \n", "\n", "* Write a helper-function called `preprocess`, which removes the punctuation specified by the user, and returns the same string without the unwanted characters. You should call the function `preprocess` inside the `my_word_count` function.\n", "\n", "* Remember that there are string methods that you can use to get rid of unwanted characters. Test the `preprocess` function using the following string `'this is a (tricky) test'` by attempting to remove the opening and closing parentheses.\n", "\n", "* Remember how we used dictionaries to count words? If not, have a look at Chapter 10 - Dictionaries. \n", "\n", "* Make sure you split the string on a space character ' '. You then can loop over the list to count the words.\n", "\n", "* Test your function using an example string, which will tell you whether it fulfills the requirements (remove punctuation, split, count).\n", "\n", "#### b.) Create a python script \n", "\n", "Use your editor to create a Python script called **count_words.py**. Place the function definition of the **my_word_count** function in **count_words.py**. Also call the function **my_word_count** in this file to test it. Print the results (you can choose the parameters of this call). Place your helper function definition, i.e., **preprocess**, in a separate script called **utils_3a.py**. Import your helper function **preprocess** into count_words.py. Test whether everything works as expected by calling the script count_words.py from the terminal.\n", "\n", "The function **preprocess** preprocesses the text by removing characters that are unwanted by the user. **preprocess** is called within the **my_word_count**. The function **my_word_count** uses the output from the preprocess function and creates a dictionary in which the key is a word and the value is the frequency of the word.\n", "\n", "**Please submit these scripts together with the notebooks**.\n", "\n", "Don't forget to add docstrings to your functions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feel free to use this cell to try out your code. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dealing with text files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "\n", "**Playing with lyrics**\n", "\n", "a.) Write a function called `load_text`, which opens and reads a file and returns the text in the file. It should have the file path as a parameter. Test it by loading this file: ../Data/lyrics/walrus.txt\n", "\n", "* Hint: remember it is best practice to use a context manager\n", "* Hint: **FileNotFoundError**: This means that the path you provide does not lead to an existing file on your computer. Please carefully study Chapter 14. Please determine where the notebook or Python module that you are working with is located on your computer. Try to determine where Python is looking if you provide a path such as “../Data/lyrics/walrus.txt”. Try to go from your notebook to the location on your computer where Python is trying to find the file. One tip: if you did not store the Assignments notebooks 3a and 3b in the folder “Assignments”, you would get this error.\n", "\n", "b.) Write a function called `replace_walrus`, which takes lyrics as input and replaces every instance of 'walrus' by 'hippo' (make sure to account for upper and lower case - it is fine to transform everything to lower case). The function should write the new version of the song to a file called 'walrus_hippo.txt and stored in ../Data/lyrics. \n", "\n", "Don't forget to add docstrings to your functions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing text with nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5\n", "\n", "**Building a simple NLP pipeline**\n", "\n", "For this exercise, you will need NLTK. Don't forget to import it. \n", "\n", "Write a function called `tag_text`, which takes raw text as input and returns the tagged text. To do this, make sure you follow the steps below:\n", "\n", "* Tokenize the text. \n", "\n", "* Perform part-of-speech tagging on the list of tokens. \n", "\n", "* Return the tagged text\n", "\n", "\n", "Then test your function using the text snipped below (`test_text`) as input.\n", "\n", "Please note that the tags may not be correct and that this is not a mistake on your end, but simply NLP tools not being perfect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_text = \"\"\"Shall I compare thee to a summer's day?\n", "Thou art more lovely and more temperate:\n", "Rough winds do shake the darling buds of May,\n", "And summer's lease hath all too short a date:\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python knowledge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6\n", "\n", "6.a) Explain in your own words the difference between the global and the local scope." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[answer]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6.b) What is the difference between the modes 'w' and 'a' when opening a file?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[answer]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "1f37b899a14b1e53256e3dbe85dea3859019f1cb8d1c44a9c4840877cfd0e7ef" } } }, "nbformat": 4, "nbformat_minor": 4 }