{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 4a: Data structures (CSV/TSV and JSON)\n", "\n", "\n", "* Please name your notebook with the following naming convention: \n", " ASSIGNMENT_4a_FIRSTNAME_LASTNAME.ipynb \n", "* Please submit your complete assignment (4a + 4b) by compressing all your material into **a single .zip file** following this naming convention: ASSIGNMENT_4_FIRSTNAME_LASTNAME.zip. \n", "\n", "## Please note that there is a BA and an MA version of Assignment 4b\n", "\n", "In case you are not sure about creating a zip file from a folder, please refer to [this guide](https://fossbytes.com/how-to-zip-file-in-windows-mac/) (or any other guide you find online).\n", "\n", "\n", "\n", "If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com. Questions and answers will be collected in [this Q&A document](https://docs.google.com/document/d/1ynQAqPa2CGB02okyyE4F1StytDqpyRoBqUpWfeBqI_Y/edit?usp=sharing), so please check if your question has already been answered. \n", "\n", "In this block, we covered the following chapters about data formats:\n", "\n", "- Chapter 16 - Data Formats I (CSV/TSV)\n", "- Chapter 17 - Data Formats II (JSON)\n", "- Chapter 18 - Data Formats III (XML) *only for master-level course*\n", "\n", "In this assignment, you will also have to apply your knowledge about containers. If you get stuck, you are likely to find solutions in the chapters about containers (Block 2). \n", "\n", "\n", "**Tip**:\n", "\n", "It could happen that your code throws a unicode error when you're trying to open one of the files used in this assignment. If this is the case, you can probably solve if by specifying the encoding when reading in the file:\n", "\n", "```python\n", "with open(your/file/path, 'r', encoding = 'utf-8') as infile:\n", " #your code\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: Trump's Facebook Status Updates (CSV/TSV)\n", "\n", "In the folder `../Data/csv_data` there is a TSV file called `trump_facebook.tsv` that contains Facebook status updates posted by Donald Trump. It was downloaded from [here](https://www.reddit.com/r/datasets/comments/581hqm/all_of_donald_trumps_facebook_statuses_reaction). Follow the instructions below to read the file and find specific status updates.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1a. Write your own function for reading CSV\n", "Write a function called `read_csv()` that has two parameters: \n", "\n", "* **`input_file`** (positional parameter) and \n", "* **`delimiter`** (keyword parameter with default string `\",\"`). \n", "\n", "The function should read the file and return `status_updates` which contains the content of the file as a 'list of dicts'. When tested on `../Data/Trump-Facebook/FacebookStatuses.tsv` the first two status updates should thus be represented as follows:\n", "\n", "```\n", "[{'link_name': 'Timeline Photos',\n", " 'num_angrys': '7',\n", " 'num_comments': '543',\n", " 'num_hahas': '17',\n", " 'num_likes': '6178',\n", " 'num_loves': '572',\n", " 'num_reactions': '6813',\n", " 'num_sads': '0',\n", " 'num_shares': '359',\n", " 'num_wows': '39',\n", " 'status_id': '153080620724_10157915294545725',\n", " 'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',\n", " 'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',\n", " 'status_published': '10/17/2016 20:56:51',\n", " 'status_type': 'photo'},\n", " {'link_name': '',\n", " 'num_angrys': '5211',\n", " 'num_comments': '3644',\n", " 'num_hahas': '75',\n", " 'num_likes': '26649',\n", " 'num_loves': '487',\n", " 'num_reactions': '33768',\n", " 'num_sads': '191',\n", " 'num_shares': '17653',\n", " 'num_wows': '1155',\n", " 'status_id': '153080620724_10157914483265725',\n", " 'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/',\n", " 'status_message': \"The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!\",\n", " 'status_published': '10/17/2016 18:00:41',\n", " 'status_type': 'video'}]\n", "```\n", "\n", "**DO NOT USE THE CSV MODULE FOR THIS EXERCISE!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_csv(input_file, delimiter=\",\"):\n", " # your code here\n", "\n", "\n", "# test your function here\n", "filename = \"../Data/csv_data/trump_facebook.tsv\"\n", "status_updates = read_csv(filename, delimiter=\"\\t\") \n", "status_updates[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you didn't manage to create the `read_csv()` function, run the following code using the `DictReader()` method from the `csv` module to get the data in the right format for the following exercises:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv\n", "\n", "filename = \"../Data/csv_data/trump_facebook.tsv\"\n", "with open(filename, \"r\") as infile:\n", " status_updates = []\n", " csv_reader = csv.DictReader(infile, delimiter='\\t')\n", " for row in csv_reader:\n", " status_updates.append(row)\n", "status_updates[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1b. Find the status updates with the most responses\n", "\n", "Define a function called **`get_update_most_responded_to()`** that has the following parameters: \n", "* **`status_updates`** (positional parameter) \n", "* **`response_type`** (keyword parameter with default string `\"likes\"`) \n", "\n", "The fuction should find the status update that received the highest number of possible reactions to a Facebook status (emoji such as 'angrys', 'comments', 'hahas', etc. - anything that starts with 'num_'). It should return three strings: the **`status_message`**, the **`status_type`** and the **`status_link`** of this particular status update.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_update_most_responded_to():\n", " # your code here\n", "\n", "# test your function here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1c. Find the longest status updates\n", "\n", "Define a function called **`get_longest_update()`** that has the following parameters: \n", "* **`status_updates`** (positional parameter) \n", "* **`length_type`** (keyword parameter with default string `\"tokens\"`). \n", "\n", "The function should find the longest update. By default, the fuction should find the status update that is the longest in terms of number of tokens. Also implement the options to find the longst status update in terms of characters or sentences in the message. These options should be carried out when `length_type` is changed to `\"sentences\"` or `\"characters\"` \n", "\n", "The function should return the status message (called `'status_message'` in the data structure) of the longest update as a string. \n", "\n", "**Attention**: It is recommended to use NLTK for this exercise. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_longest_update():\n", " # your code here\n", "\n", "# test your function here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1d. Find the status updates containing specific keywords\n", "\n", "Define a function called **`get_updates_with_keywords()`** that takes three input arguments: \n", "\n", "* **`status_updates`** (mandatory positional argument) \n", "* **`keywords`** (mandatory positional argument) \n", "* **`case_sensitive`** (keyword argument with default `False`)\n", "\n", "The fuction should find the status updates that contain **any of the keywords**. The parameter `case_sensitive` should specify whether uppercase and lowercase characters must be treated as distinct. \n", "\n", "The function should return **`filtered_status_updates`**, which is a list of dictioaries with all information about the status updates (same format as the input argument `'status_updates'`). \n", "\n", "**Attention**: It is highly recommended to use NLTK for this exercise. Make sure that you **tokenize** the messages before you look for keywords. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_update_with_keywords():\n", " # your code here\n", "\n", "keywords = [\"clinton\", \"obama\"] # test with these keywords; also experiment with other keywords\n", "# test your function here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: Nobel Prize Winners (JSON)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a lot of interesting data online. For example, the [Nobel Prize Organisaton](https://www.nobelprize.org) provides the [Nobel Prize API](https://nobelprize.readme.io) that allows you to download information about the prizes, the laureates and the countries. \n", "\n", "The information is formatted in JSON. Have a look at the following URLs:\n", "- http://api.nobelprize.org/v1/prize.json\n", "- http://api.nobelprize.org/v1/laureate.json\n", "- http://api.nobelprize.org/v1/country.json\n", "\n", "For this exercise, we will only look at the prizes and the laureates. \n", "\n", "We can download the data using the `requests` module. How this works is shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download data on prizes\n", "api_url = \"http://api.nobelprize.org/v1/prize.json\"\n", "r = requests.get(api_url)\n", "dict_prizes = r.json()\n", "# uncomment the line below if you'd like to see what's inside dict_prizes\n", "#dict_prizes " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download data on laureates\n", "api_url = \"http://api.nobelprize.org/v1/laureate.json\"\n", "r = requests.get(api_url)\n", "dict_laureates = r.json()\n", "# uncomment the line below if you'd like to see what's inside dict_prizes\n", "#dict_laureates " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2a. Read the JSON files\n", "\n", "We have already stored the data as the JSON files `laureate.json` and `prize.json` in the folder `../Data/json_data/NobelPrize`. Open these JSON files and load them as the Python dictionaries `dict_laureates` and `dict_prizes`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load laureates.json and prize.json here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2b. Get all laureates from year and category\n", "\n", "Create a function called **`get_laureates()`** that thas three parameters: \n", "\n", "* **`dict_prizes`** (positional parameter) \n", "* **`year`** (keyword parameter with default `None`) \n", "* **`category`** (keyword parameter with default `None`) \n", "\n", "The function should find all laureates that received the Nobel Prize, optionally in a specific year and/or category (specified using the keywords `year` and `category`). It should return a list of the full names of the laureates.\n", "\n", "For example, for the year 2018 and category \"peace\" it should return the list `['Denis Mukwege', 'Nadia Murad']`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_laureates():\n", " # your code here \n", "\n", "\n", "year = 2018\n", "category = \"peace\"\n", "# test your function here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2c. Get all prizes from affiliations\n", "\n", "Create a function called **`get_affiliation_prizes()`** that takes one input parameters: \n", "\n", "* **`dict_laureates`** (positional parameter) \n", "\n", "The function should find all affiliations that were involved in winning the Nobel Prize and provide information on the category and year of those Nobel Prizes. It should return a nested dictionary of the following format:\n", "\n", "```\n", "{\n", " \"A.F. Ioffe Physico-Technical Institute\": [\n", " {\"category\": \"physics\", \"year\": \"2000\"}\n", " ],\n", " \"Aarhus University\": [\n", " {\"category\": \"chemistry\", \"year\": \"1997\"},\n", " {\"category\": \"economics\",\"year\": \"2010\"}\n", " ]\n", "}\n", "```\n", "\n", "**Tip:** some of the entries will lack information (for example, there is no associated affiliation). Use `if-statements` to check if essential information is present. \n", "\n", "**General tip for working with data**: If your code breaks, check whether your assumptions about the data hold (very often, they unfortunatelydo not). For instance, a dictionary key you thought was always present is missing from a couple of dictionaries, etc. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_affiliation_prizes():\n", " # your code here\n", "\n", "# test your function here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2d. Write to JSON\n", "\n", "Next, write the dictionary created in the previous exercise to a JSON file using the following path: \n", "\n", "`../Data/json_data/NobelPrize/nobel_prizes_affiliations.json`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write the resulting dictionary to 'json_file'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }