{ "cells": [ { "cell_type": "markdown", "source": [ "# Chapter 17: Data formats (JSON)\n", "\n", "Let's have a look at another data format: JSON (JavaScript Object Notation). JSON is a lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate. There is a good chance that at some point you will be gathering information through an API that is formatted in JSON (e.g. Twitter data). \n", "\n", "\n", "### At the end of this chapter, you will be able to:\n", "\n", "* read and write JSON dictionary files\n", "* deal with multiple layers of nesting within JSON dictionary structures\n", "\n", "### If you want to learn more about these topics, you might find the following links useful:\n", "* [Video: Working With JSON](https://www.youtube.com/watch?v=Kf0q4Tf5M3c)\n", "* [Tutorial: Working With JSON Data in Python](https://realpython.com/python-json/)\n", "\n", "If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 1. Introduction to JSON\n", "\n", "JSON is completely language independent. However, data formatted in JSON works just like a Python dictionary! \n", "\n", "![box](./images/python_json_conversion_table.png)\n", "\n", "\n", "We will show how JSON looks like and how to deal with JSON in Python with the example dictionary shown below. " ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "dict_doe_family = { \n", " \"John\": {\n", " \"first name\": \"John\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"male\", \n", " \"age\": 30, \n", " \"favorite_animal\": \"panda\",\n", " \"married\": True,\n", " \"children\": [\"James\", \"Jennifer\"],\n", " \"hobbies\": [\"photography\", \"sky diving\", \"reading\"]},\n", " \"Jane\": {\n", " \"first name\": \"Jane\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"female\", \n", " \"age\": 27, \n", " \"favorite_animal\": \"zebra\",\n", " \"married\": False,\n", " \"children\": None,\n", " \"hobbies\": [\"cooking\", \"gaming\", \"tennis\"]}}" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "You can inspect any of the JSON files that we will generate or load below using a text editor (e.g. [Atom](https://atom.io/), [BBEdit](https://www.barebones.com/products/bbedit/download.html) or [Notepad++](https://notepad-plus-plus.org))." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 2. Working with JSON in Python\n", "\n", "You could read in a JSON file just like any other text file:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "with open('../Data/json_data/fruits.json') as infile:\n", " text = infile.read()\n", "print(text)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{\n", " \"fruit\": \"Apple\",\n", " \"size\": \"Large\",\n", " \"color\": \"Red\"\n", "}\n", "\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "However, since it is structured to correspond well to Python objects, we use an existing mudule, the [**json**](https://docs.python.org/3/library/json.html) library, which provides an easy way to encode and decode data in JSON. Let's first import it:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import json" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "We will focus on the following methods:\n", "* **to read JSON**: the methods `json.load()` and `json.loads()`\n", "* **to write JSON**: the methods `json.dump()` and `json.dumps()`\n", "\n", "The functions with an **s** take string arguments." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 2.1 Loading JSON from file or string\n", "The `load()` method is used to load a JSON encoded file as a Python dictionary:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# open file just as you would open a txt file\n", "with open(\"../Data/json_data/Doe.json\", \"r\") as infile:\n", " # read in file content as dict using the json module\n", " dict_doe_family = json.load(infile)\n", " print(type(dict_doe_family))\n", " print(dict_doe_family)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The `loads()` method is used to load a JSON formatted string as a Python dictionary. This is useful if you want to create a json dictionary from a string. There are not many situations in which you will have to use it. You may come accross json structures stored as strings in when working with corpora or collecting human annotations with annotation software. \n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "str_doe_family = \"\"\"\n", "{\n", " \"Jane\": {\n", " \"age\": 27,\n", " \"children\": null,\n", " \"favorite_animal\": \"zebra\",\n", " \"first name\": \"Jane\",\n", " \"gender\": \"female\",\n", " \"hobbies\": [\n", " \"cooking\",\n", " \"gaming\",\n", " \"tennis\"\n", " ],\n", " \"last name\": \"Doe\",\n", " \"married\": false\n", " },\n", " \"John\": {\n", " \"age\": 30,\n", " \"children\": [\n", " \"James\",\n", " \"Jennifer\"\n", " ],\n", " \"favorite_animal\": \"panda\",\n", " \"first name\": \"John\",\n", " \"gender\": \"male\",\n", " \"hobbies\": [\n", " \"photography\",\n", " \"sky diving\",\n", " \"reading\"\n", " ],\n", " \"last name\": \"Doe\",\n", " \"married\": true\n", " }\n", "}\n", "\"\"\"\n", "\n", "dict_doe_family = json.loads(str_doe_family)\n", "print(type(dict_doe_family))\n", "print(dict_doe_family)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 2.2 Writing JSON to file or string\n", "Let's first define our Python dictionary again:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "dict_doe_family = { \n", " \"John\": {\n", " \"first name\": \"John\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"male\", \n", " \"age\": 30, \n", " \"favorite_animal\": \"panda\",\n", " \"married\": True,\n", " \"children\": [\"James\", \"Jennifer\"],\n", " \"hobbies\": [\"photography\", \"sky diving\", \"reading\"]},\n", " \"Jane\": {\n", " \"first name\": \"Jane\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"female\", \n", " \"age\": 27, \n", " \"favorite_animal\": \"zebra\",\n", " \"married\": False,\n", " \"children\": None,\n", " \"hobbies\": [\"cooking\", \"gaming\", \"tennis\"]}}" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The **`json.dump()`** method is used to write a Python dictionary to a JSON encoded file:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "with open(\"../Data/json_data/Doe.json\", \"w\") as outfile:\n", " json.dump(dict_doe_family, outfile)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The **`dumps()`** method is used to convert a Python dictionary to a JSON formatted string:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "str_doe_family = json.dumps(dict_doe_family)\n", "print(str_doe_family)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Both `dump()` and `dumps()` use the same keyword parameters. You can check them out with `help()`:\n", "\n", "**Hint**: There are ways of formating the dictionary nicely. Can you find out how it works from reading the documentation?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "help(json.dumps)\n", "#help(json.dump)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Two useful keyword arguments are for example `indent` and `sort_keys`. They are illustrated below:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Create the JSON file\n", "with open(\"../Data/json_data/Doe.json\", \"w\") as outfile:\n", " json.dump(dict_doe_family, \n", " outfile, \n", " indent=4, \n", " sort_keys=True)\n", "\n", "# Read in the JSON file again\n", "with open(\"../Data/json_data/Doe.json\", \"r\") as infile:\n", " json_string = infile.read()\n", " print(json_string)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "str_doe_family = json.dumps(dict_doe_family, \n", " indent=4, \n", " sort_keys=True)\n", "print(str_doe_family)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 3. Accessing data in a json dictionary\n", "\n", "Note that the values in this dictionary can be containers themselves: Each key has another dictionary as a value. The keys 'children' and 'hobbies' have lists as values. Note that we can look at such a dictionary in terms of **layers of nesting**.\n", "\n", "\n", "Consider the dict below. How many layers (i.e. containers within containers) can you identify?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "dict_doe_family = { \n", " \"John\": {\n", " \"first name\": \"John\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"male\", \n", " \"age\": 30, \n", " \"favorite_animal\": \"panda\",\n", " \"married\": True,\n", " \"children\": [\"James\", \"Jennifer\"],\n", " \"hobbies\": [\"photography\", \"sky diving\", \"reading\"]},\n", " \"Jane\": {\n", " \"first name\": \"Jane\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"female\", \n", " \"age\": 27, \n", " \"favorite_animal\": \"zebra\",\n", " \"married\": False,\n", " \"children\": None,\n", " \"hobbies\": [\"cooking\", \"gaming\", \"tennis\"]}}" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Often, we will need to extract information from such a nested structure. To do this, it is helpful to remember what you have learned in Block II about containers and looping. \n", "\n", "Let's say we want to extract the hobbies of Doe family members. How could we approach this? There are various ways of working your way through the layers. \n", "\n", "For instance, you can use dictionary keys:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# access information about John\n", "john_info = dict_doe_family['John']\n", "print(john_info)\n", "# access information about John's hobbies:\n", "john_hobbies = john_info['hobbies']\n", "print(john_hobbies)\n", "\n", "# You can also do this in one go:\n", "john_hobbies = dict_doe_family['John']['hobbies']\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Now let's extract the hobbies of all family members. Can you finish the code below?\n", "\n", "**Hint**: If you are confused about how to access elements in the structure, the type() function is your friend! If you are having a hard time remembering how to deal with the different containers, please have a look at the chapters about containers. Also, remember that you can always get a list of all methods of a container by using the dir() function (+ help() to see what they are doing). " ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# iterate over family dict by accessing \n", "#the family members (keys) and their information (values):\n", "\n", "all_hobbies = []\n", "for member, info_dict in dict_doe_family.items():\n", " # check what we are accessing:\n", " print(member, type(info_dict))\n", " # access hobbies from info_dict\n", " hobbies = info_dict['hobbies']\n", " print(hobbies)\n", " # your code here\n", " " ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Exercises" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 1:\n", "What is the difference between `str()` and `json.dumps()` to convert a Python dictionary to a string? Try it out below." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "dict_doe_family = { \n", " \"John\": {\n", " \"first name\": \"John\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"male\", \n", " \"age\": 30, \n", " \"favorite_animal\": \"panda\",\n", " \"married\": True,\n", " \"children\": [\"James\", \"Jennifer\"],\n", " \"hobbies\": [\"photography\", \"sky diving\", \"reading\"]},\n", " \"Jane\": {\n", " \"first name\": \"Jane\", \n", " \"last name\": \"Doe\", \n", " \"gender\": \"female\", \n", " \"age\": 27, \n", " \"favorite_animal\": \"zebra\",\n", " \"married\": False,\n", " \"children\": None,\n", " \"hobbies\": [\"cooking\", \"gaming\", \"tennis\"]}}" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "str_doe_family = json.dumps(dict_doe_family)\n", "print(str_doe_family)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "str_doe_family = str(dict_doe_family)\n", "print(str_doe_family)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 2:\n", "Working with JSON data thus means working with dictionaries. Let's practice a bit more with accessing the values of dictionaries by printing the following:\n", "* all (parent) names\n", "* the age of Jane\n", "* the first hobby of John\n", "* the favorite animal of each person (use a for-loop)" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Example: This will print the gender of \"John\"\n", "print(dict_doe_family[\"John\"][\"gender\"])" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 3:\n", "Please add the following entry to `dict_doe_family`. Write the result to a JSON file." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "julia = {\"Julia\": {\"first name\": \"Julia\", \n", " \"last name\": \"Doe\",\n", " \"age\": 29, \n", " \"favorite_animal\": \"penguin\",\n", " \"married\": False,\n", " \"children\": [\"Jack\"],\n", " \"hobbies\": [\"snowboarding\", \"hiking\"]}}\n", "\n", "# your code here" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 4:\n", "Let's have a look at another example. The JSON file `../Data/json_data/StrangerThings.json` contains information about episodes from the TV show Stranger Things. Read the file and store the dictionary as `tv_show`. " ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "tv_show = # your code here" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "To help you understand the structure a bit, first have a look at the following examples:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "print(tv_show.keys())" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "print(tv_show[\"_embedded\"].keys())" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "print(tv_show[\"_embedded\"][\"episodes\"])" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Print the following information about the TV show:\n", "* the language \n", "* the summary \n", "* the genres \n", "* the average rating\n", "* the url of the original image of the TV show" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "\n" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Print the following information about its episodes:\n", "* the number of episodes\n", "* all urls of the episodes\n", "* the summary of the 5th episode (Chapter Five) of season 1\n", "* the url of the original image of the episode with id 578664\n", "* all names and season numbers of the episodes where \"Nancy\" is mentioned in the summary" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} } ], "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3.8.10 64-bit ('nlp': conda)" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "interpreter": { "hash": "6106edc083458b68f61c14c570e0f5152b4e1e25a61780539c3fe413e38ae5e6" } }, "nbformat": 4, "nbformat_minor": 4 }