{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with JSON and XML files in Python\n", "\n", "\"Creative\n", "This tutorial is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.\n", "\n", "## Lab Goals\n", "\n", "## Acknowledgements\n", "\n", "Information and exercises in this lab are adapted from:\n", "- Al Sweigart, \"Chapter 16, Working with CSV Files and JSON Data\" in [*Automate the Boring Stuff With Python*](https://nostarch.com/automatestuff2) (No Starch Press, 2020): 371-388.\n", "- Wes McKinney, \"Chapter 6.1, Reading and Writing Data in Text Format\" in [*Python for Data Analysis*](https://www.oreilly.com/library/view/python-for-data/9781491957653/) (O'Reilly, 2017): 169-184.\n", "- Charles Severance, \"Chapter 13, Using Web Services\" in [*Python for Everybody*](https://www.py4e.com/book.php) (Charles Severance, 2009): 155-170.\n", "\n", "The XML portions of this lab are adapted from the \"Project 4: XML and XSLT\" project materials developed by [Lindsay K. Mattock](http://lindsaymattock.net/) for the the [SLIS 5020 Computing Foundations course](http://lindsaymattock.net/computingfoundations.html). \n", "\n", "# Table of Contents\n", "\n", "- [Data](#data)\n", "- [JSON](#json)\n", " * [What is JSON and why are we learning about it](#what-is-json-and-why-are-we-learning-about-it)\n", " * [Reading JSON into Python](#reading-json-into-python)\n", " * [Working with JSON in Python](#working-with-json-in-python)\n", " * [Writing to JSON from Python](#writing-to-json-from-python)\n", " * [JSON Project Prompt](#json-project-prompt)\n", "- [XML](#xml)\n", " * [What is XML and why are we learning about it](#what-is-xml-and-why-are-we-learning-about-it)\n", " * [XML Versus HTML](#xml-versus-html)\n", " * [XML Example 1](#xml-example-1)\n", " * [XML Example 2](#xml-example-2)\n", " * [Reading XML Into Python](#reading-xml-into-Python)\n", " * [Parsing XML in Python](#parsing-xml-in-python)\n", " * [Working With XML in Python](#working-with-xml-in-python)\n", " * [Writing to XML from Python](#writing-to-xml-from-python)\n", " * [XML Project Prompt](#xml-project-prompt)\n", "- [Lab Notebook Questions](#lab-notebook-questions)\n", "\n", "# Data\n", "\n", "The only data needed for this lab is the `books.xml` file, which can be dowloaded from this GitHub repo.\n", "\n", "[Link to Google Drive access (ND users only)](https://drive.google.com/drive/folders/1fa78Av2rELSuk2Nhj1DR8yErJeZQRNbN?usp=sharing)\n", "\n", "# JSON\n", "\n", "## What is JSON and why are we learning about it\n", "\n", "1. JavaScript Object Notation (JSON) is as popular way to format data as a single (purportedly human-readable) string. \n", "\n", "2. JavaScript programs use JSON data structures, but we can frequently encounter JSON data outside of a JavaScript environment.\n", "\n", "3. Websites that make machine-readable data available via an application programming interface (API- more on these in an upcoming lab) often provide that data in a JSON format. Examples include Twitter, Wikipedia, Data.gov, etc. Most live data connections available via an API are provided in a JSON format.\n", "\n", "4. JSON structure can vary WIDELY depending on the specific data provider, but this lab will cover some basic elements of working with JSON in Python.\n", "\n", "5. The easiest way to think of JSON data as a plain-text data format made up of something like key-value pairs, like we've encountered previously in working with dictionaries.\n", "\n", "6. Example JSON string: `stringOfJsonData = '{\"name\": \"Zophie\", \"isCat\": true, \"miceCaught\": 0, \"felineIQ\": null}'`\n", "\n", "7. From looking at the example string, we can see field names or keys (`name`, `isCat`, `miceCaught`, `felineIQ`) and values for those fields.\n", "\n", "8. To use more precise terminology, JSON data has the following attributes:\n", "- uses name/value pairs\n", "- separates data using commas\n", "- holds objects using curly braces `{}`\n", "- holds arrays using square brackets `[]`\n", "\n", "9. In our example `stringOfJsonData`, we have an object contained in curly braces. \n", "\n", "10. An object can include multiple name/value pairs. Multiple objects together can form an array.\n", "\n", "11. Values stored in JSON format must be one of the following data types:\n", "- string\n", "- number\n", "- object (JSON object)\n", "- array\n", "- boolean\n", "- null\n", "\n", "12. How is data stored in a JSON format different than a CSV? \n", "\n", "13. A `.csv` file uses characters as delimiters and has more of a tabular (table-like) structure.\n", "\n", "14. JSON data uses characters as part of the syntax, but not in the same way as delimited data files. \n", "\n", "15. Additionally, the data stored in a JSON format has values that are attached to names (or keys).\n", "\n", "16. JSON can also have a hierarchical or nested structure, in that objects can be stored or nested inside other objects as part of the same array.\n", "\n", "17. For example, take a look at sapmle JSON data from Twitter's API:\n", "```JSON\n", "{\n", " \"created_at\": \"Thu Apr 06 15:24:15 +0000 2017\",\n", " \"id_str\": \"850006245121695744\",\n", " \"text\": \"1\\/ Today we\\u2019re sharing our vision for the future of the Twitter API platform!\\nhttps:\\/\\/t.co\\/XweGngmxlP\",\n", " \"user\": {\n", " \"id\": 2244994945,\n", " \"name\": \"Twitter Dev\",\n", " \"screen_name\": \"TwitterDev\",\n", " \"location\": \"Internet\",\n", " \"url\": \"https:\\/\\/dev.twitter.com\\/\",\n", " \"description\": \"Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\\/\\/twittercommunity.com\\/ \\u2328\\ufe0f #TapIntoTwitter\"\n", " },\n", " \"place\": { \n", " },\n", " \"entities\": {\n", " \"hashtags\": [ \n", " ],\n", " \"urls\": [\n", " {\n", " \"url\": \"https:\\/\\/t.co\\/XweGngmxlP\",\n", " \"unwound\": {\n", " \"url\": \"https:\\/\\/cards.twitter.com\\/cards\\/18ce53wgo4h\\/3xo1c\",\n", " \"title\": \"Building the Future of the Twitter API Platform\"\n", " }\n", " }\n", " ],\n", " \"user_mentions\": [ \n", " ]\n", " }\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Q1: Decipher what we're seeing in the JSON here. What are the name/value pairs, and how are they organized in this object?
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading JSON into Python\n", "\n", "18. We can read JSON into Python using the `json` module.\n", "\n", "
Click here to learn more about the json module.
\n", "\n", "19. The `json.loads()` and `json.dumps()` functions translate JSON data and Python values.\n", "\n", "20. Translation table:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
JSONPython
objectdict
arraylist
stringstr
number (int)int
number (real)float
trueTrue
falseFalse
nullNone
\n", "\n", "21. To translate a string of JSON data into a Python value, we pass it to the `json.loads()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# string of JSON data\n", "stringOfJsonData = '{\"name\": \"Zophie\", \"isCat\": true, \"miceCaught\": 0, \"felineIQ\": null}'\n", "\n", "# load JSON data as Python value \n", "jsonDataAsPythonValue = json.loads(stringOfJsonData)\n", "\n", "# output JSON string as Python value\n", "jsonDataAsPythonValue" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "22. This block of code imports the `json` module, calls the `loads()` function and passes a string of JSON data to the `loads()` function.\n", "\n", "23. NOTE: JSON strings always use double quotes, which is rendered in Python as a dictionary. Because Python dictionaries are not ordered, the order of the Python dictionary may not match the original JSON string order.\n", "\n", "## Working with JSON in Python\n", "\n", "24. Now that the JSON data is stored as a dictionary in Python, we can interact with it via the functionality avaialble via Python dictionaries.\n", "\n", "25. We could get all of the keys in the dictionary using the `keys()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# string of JSON data\n", "stringOfJsonData = '{\"name\": \"Zophie\", \"isCat\": true, \"miceCaught\": 0, \"felineIQ\": null}'\n", "\n", "# load JSON data as Python value \n", "jsonDataAsPythonValue = json.loads(stringOfJsonData)\n", "\n", "# print list of keys\n", "print(jsonDataAsPythonValue.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "26. We could get all of the values in the dictionary using the `values()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# string of JSON data\n", "stringOfJsonData = '{\"name\": \"Zophie\", \"isCat\": true, \"miceCaught\": 0, \"felineIQ\": null}'\n", "\n", "# load JSON data as Python value \n", "jsonDataAsPythonValue = json.loads(stringOfJsonData)\n", "\n", "# print list of values\n", "print(jsonDataAsPythonValue.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "27. We could iterate by keys over the items in the dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# string of JSON data\n", "stringOfJsonData = '{\"name\": \"Zophie\", \"isCat\": true, \"miceCaught\": 0, \"felineIQ\": null}'\n", "\n", "# load JSON data as Python value \n", "jsonDataAsPythonValue = json.loads(stringOfJsonData)\n", "\n", "# iterate by keys using for loop\n", "for key in jsonDataAsPythonValue.keys():\n", " print(key, jsonDataAsPythonValue[key])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "28. We could also iterate over items in dictionary using key-value pairs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# string of JSON data\n", "stringOfJsonData = '{\"name\": \"Zophie\", \"isCat\": true, \"miceCaught\": 0, \"felineIQ\": null}'\n", "\n", "# load JSON data as Python value \n", "jsonDataAsPythonValue = json.loads(stringOfJsonData)\n", "\n", "# iterate by key value pairs using for loop\n", "for key, value in jsonDataAsPythonValue.items():\n", " print(key, value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "29. We can read the value for a particular key using the index operator. The command `jsonDataAsPythonValue['name']` will return `Zophie`.\n", "\n", "30. In situations where JSON data includes nested or hierarchical objects and arrays, we will end up with a list of dictionaries in Python.\n", "\n", "31. For example, let's say we have a different JSON example and want to use more complex expressions in Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# new json data\n", "data = '''\n", "[\n", " { \"id\" : \"001\",\n", " \"x\" : \"2\",\n", " \"name\" : \"Chuck\"\n", " } ,\n", " { \"id\" : \"009\",\n", " \"x\" : \"7\",\n", " \"name\" : \"Brent\"\n", " }\n", "]'''\n", "\n", "#load data as json\n", "info = json.loads(data)\n", "\n", "# print number of users\n", "print('User Count:', len(info))\n", "\n", "# use for loop to print list of names, IDs, and attributes\n", "for item in info:\n", " print('Name', item['name'])\n", " print('Id', item['id'])\n", " print('Attribute', item['x'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "32. For more on working with dictionaries in Python:\n", "- [Elements of Computing I lab](https://github.com/kwaldenphd/python-lab6/blob/master/README.md#working-with-dictionaries)\n", "- [W3 Schools tutorial](https://www.w3schools.com/python/python_dictionaries.asp)\n", "\n", "## Writing to JSON from Python\n", "\n", "33. The `json.dumps()` function will translate a Python dictionary into a string of JSON-formatted data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# Python dictionary\n", "pythonValue = {'isCat': True, 'miceCaught': 0, 'name': 'Zophie', 'felineIQ': None}\n", "\n", "# translate Python value to JSON string\n", "stringOfJsonData = json.dumps(pythonValue)\n", "\n", "stringOfJsonData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "34. We can also write data in a Python dictionary to a JSON file also using `json.dump()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import json module\n", "import json\n", "\n", "# Python dictionary\n", "pythonValue = {'isCat': True, 'miceCaught': 0, 'name': 'Zophie', 'felineIQ': None}\n", "\n", "# create new JSON file and write dictionary to file\n", "with open('output.json', 'w') as json_file:\n", " json.dump(pythonValue, json_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "35. Later in the semester we will talk about how to read JSON data into Python and convert it to a tabular data structure (called a data frame in Python), using a library called `pandas`. Stay tuned!\n", "\n", "## JSON Project Prompt\n", "\n", "36. Navigate to an open data portal and download a JSON file. \n", "\n", "37. Some options that can get you started:\n", "- [Data.gov](https://www.data.gov/)\n", "- [City of Chicago Data Portal](https://data.cityofchicago.org/)\n", "- [City of South Bend Open Data](https://data-southbend.opendata.arcgis.com/)\n", "\n", "38. Open the data in a spreadsheet program and/or text editor \n", "\n", "39. Describe what are you seeing. How can we start to make senes of this data? What documentation is available?\n", "\n", "40. Read the JSON data into Python and convert to a Python value.\n", "\n", "41. Create your own small dictionary with data and convert to JSON string.\n", "\n", "# XML\n", "\n", "## What is XML and why are we learning about it\n", "\n", "42. Unlike HTML which allowed us to mark up and display information, XML is used for descriptive standards. \n", "\n", "43. For information professionals that work in places like libraries, XML is commonly associated with metadata--the descriptive information needed to describe information. That is, XML is used to encode metadata. \n", "\n", "44. Another example of digital work built on XML falls under the umbrella of the [Text Encoding Initiative](https://tei-c.org/), a group that’s been around since the 1970s and the early days of digital humanities. TEI includes a standardized set of tags that are used for marking or encoding various parts of a text.\n", "\n", "45. Sample projects that use TEI:\n", "- [Digital Archives and Pacific Culture](http://pacific.pitt.edu/InjuredIsland.html)\n", "- [The Digital Temple](https://digitaltemple.rotunda.upress.virginia.edu/)\n", "- [The Diary of Mary Martin](https://dh.tcd.ie/martindiary/)\n", "- [Toyota City Imaging Project](http://www.bodley.ox.ac.uk/toyota/)\n", "- [African American Women Writers](http://digital.nypl.org/schomburg/writers_aa19/)\n", "- [The Walt Whitman Archive](https://whitmanarchive.org/)\n", "\n", "46. XML is designed to store and transport data, it does not DO anything - XML is simply information that is wrapped in a set of tags. \n", "\n", "47. These tags can be user defined or from a standardized schema (like TEI). \n", "\n", "48. So, users of XML are free to develop their own set of tags or content standards in XML to describe whatever kind of information they would like.\n", "\n", "49. The root element is the `parent` element to all of the other elements within an XML document. \n", "\n", "50. The elements are arranged hierarchically: `parent` elements have `child` elements, `child` elements have `sibling` or `subchild`’ elements. \n", "\n", "51. The indentation is used to indicate the hierarchical structure of an XML document. \n", "\n", "52. NOTE: XML tags are case sensitive. This matters when we are working in Python to isolate specific components or elements in XML data.\n", "\n", "53. General XML structure:\n", "\n", "```XML\n", "\n", "\n", " \n", " xome text\n", " \n", " \n", " some more text\n", " \n", "\n", "```\n", "\n", "
XML specification from W3C: http://www.w3.org/TR/REC-xml/
\n", "\n", "### XML Versus HTML\n", "\n", "54. According to W3C.....\n", "\n", "\n", "XML and HTML were designed with different goals:\n", "\n", "\n", "\n", "Explanation from: http://www.w3schools.com/xml/xml_whatis.asp\n", "\n", "### XML Example 1\n", "\n", "55. We can create an XML document describing the information in this table.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Course TitleCourse ScheduleOn-Campus Meeting LocationInstructorOffice LocationEmailOffice Hours
Elements of Computing IIT/R 5:30-6:45 PMDeBartolo 102Dr. Katherine (Katie) Walden1046 Flannerkwalden@nd.eduT/R, 1-2 PM, 4-5 PM
\n", " \n", "```XML\n", "\n", "\tComputer Science and Engineering\n", "\t10102\n", "\tElements of Computing II\n", "\tT/R\n", "\t5:30 PM\n", "\t6:45 PM\n", "\t102\n", "\tDeBartolo\n", "\tKatherine Walden\n", "\t1046\n", "\tFlanner\n", "\tkwalden@nd.edu\n", "\t1-2 PM, 3-4 PM\n", "\n", "```\n", " \n", "56. Each piece of information is enclosed in a set of tags just like HTML, and each tag has an opening and closing tag. These are called elements. \n", "\n", "57 Unlike HTML, XML is only used to describe the data. It doesn’t provide instructions to the browser like HTML does in terms of formatting and display.\n", "\n", "58. XML elements can be further defined with attributes. \n", "\n", "59. In the first example, I used the element to describe my role in the course. In this second example, I’ve added the type “assistant teaching professor\" to further describe the instructor tag.\n", " \n", "```XML\n", "\n", "\tComputer Science and Engineering\n", "\t10102\n", "\tElements of Computing II\n", "\tT/R\n", "\t5:30 PM\n", "\t6:45 PM\n", "\t102\n", "\tDeBartolo\n", "\tKatherine Walden\n", " Katherine Walden\n", "\t1046\n", "\tFlanner\n", "\tkwalden@nd.edu\n", "\t1-2 PM, 3-4 PM\n", "\n", "```\n", "\n", "60. So, why would we want to markup all of this information in XML? \n", "\n", "61. Well, imagine that we have a list of all of the different courses taught at Notre Dame. \n", "\n", "62. If we had all of this information marked up in XML, we could run queries against the data. \n", "\n", "63. For example, we could search for all of the courses taught by Jerod Weinman, or all of the courses taught by assistant professors, or find all of the courses taught on Mondays, etc. \n", "\n", "64. This is the power of encoding data in XML.\n", "\n", "### XML Example 2\n", "\n", "65. Let’s look at a more extensive example to illustrate the basic XML syntax. \n", "\n", "```XML\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "]>\n", "\n", "\n", " \n", " HTML and XHTML: The Definitive Guide\n", " \n", " Bill\n", " Kennedy\n", " \n", " \n", " Chuck\n", " Musciano\n", " \n", " 2006\n", " \n", " \n", " CSS: The Definitive Guide\n", " \n", " Eric\n", " Meyer\n", " \n", " 2007\n", " \n", " \n", " Learning XML\n", " \n", " Erik\n", " Ray\n", " \n", " 2003\n", " \n", "\n", "```\n", "\n", "66. Here I’ve created a file describing books related to XML, HTML, and CSS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Q2: Describe the structure of this XML document in your own words.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "67. The root element of this document is ``, followed by a series of `` child elements that contain information about each of the individual books described in the document. \n", " \n", "68. Each `` has an attribute `@category` that describes the subject of the book, a `` (with an attribute `@language`), `<author>` (with child elements `<firstName>` and `<lastName>`), and a publication `<year>`.\n", "\n", "<p align=\"center\"><a href=\"https://github.com/kwaldenphd/json-xml-python/blob/main/Figure_1.jpg?raw=true\"><img class=\"aligncenter\" src=\"https://github.com/kwaldenphd/json-xml-python/blob/main/Figure_1.jpg?raw=true\" /></a></p>\n", "\n", "69. We can represent the structure generically in a graph, demonstrating the hierarchical structure of the XML document.\n", "\n", "## Reading XML into Python\n", "\n", "70. You might already be thinking about how we could interact with XML data in Python. \n", "\n", "71. XML elements have similarities to JSON's name-value pairs, and Python dictionary key-value pairs.\n", "\n", "72. We could represent the second XML example in Python using distinct dictionaries.\n", "\n", "74. One way to accomplish this goal with Python would be to generate two different dictionaries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "book_0={'title': 'CSS: The Definitive Guide', 'author': 'Eric Meyer', 'date': '2007'}\n", "book_1={'title': 'Learning XML', 'author': 'Erik Ray', 'date': '2003'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "75. We could then use a list to generate a list of the metadata for each of the works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "book_0={'title': 'CSS: The Definitive Guide', 'author': 'Eric Meyer', 'date': '2007'}\n", "book_1={'title': 'Learning XML', 'author': 'Erik Ray', 'date': '2003'}\n", "\n", "books = [book_0, book_1]\n", "print(books)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "76. While the this program outputs all of the metadata, one of the advantages of our XML file is that all of the information was stored in a single place.\n", "\n", "77. Another solution would be to embed a list in a dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "books = {\n", " 'title': ['CSS: The Definitive Guide', 'Learning XML'],\n", " 'date': ['2007', '2003'],\n", " 'author': ['Eric Meyer', 'Erik Ray']\n", " }\n", "\n", "print (\"My books include books by \", books['author'], \":\")\n", "for title in books['title']:\n", " print(\"\\t\" + title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "78. This program outputs:\n", "\n", "```\n", "My books include books by Eric Meyer and Erik Ray:\n", " CSS: The Definitive Guide\n", " Learning XML\n", "```\n", "\n", "<blockquote>Note: The <code>\\t</code> is a short cut for a TAB so that my list was indented. <code>\\n</code> will generate a new line in your output.</blockquote>\n", "\n", "79. While this dictionary contains all of the data from my XML, there are some limitations. The dates are not specifically connected to the title that they are associated with. We can solve this problem by nesting a dictionary in a dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "books = {\n", " 'CSS: The Definitive Guide': {\n", " 'date': '2007', \n", " 'author': 'Eric Meyer'\n", " },\n", " 'Learning XML': {\n", " 'date': '2003',\n", " 'author': 'Erik Ray'\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "80. This example includes a dictionary called `books` that holds two other dictionaries. It uses the title as the key for each of the dictionaries for the works. The value of each of these keys is a dictionary containing “title”, “date”, and \"author.”\n", "\n", "81. The following program will output the titles of items in this collection as well as associated dates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"My Books: \")\n", "\n", "for book, book_info in books.items():\n", " full_title = str(book) + \" (\" + str(book_info['date']) + \")\"\n", " print(\"\\t\" + full_title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "82. This program prints the title first as a separate line. The `for` loop digs into the data for each of the books. \n", "\n", "83. `book` is the variable assigned to each of the keys in the `books` dictionary - in this example it is the titles of each of the books. \n", "\n", "84. `book_info` is the variable assigned to each of the values. `method .items()` works through each of the dictionaries. \n", "\n", "85. The next line, assigns a variable `full_title` that concatenates each work variable with an opening `(`, the date for each work called with `book_info[‘date’]`, and a closing `)`. \n", "\n", "86. The final line prints a tab `\\t` before each concatenated string that we just created with the `full_title` variable.\n", "\n", "87. This outputs:\n", "```\n", "My Books:\n", " CSS: The Definitive Guide (2007)\n", " Learning XML (2003)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<blockquote>Q3: Write a similar dictionary for the <code>book.xml</code> file contained in this repo and generate some output. Include code + comments. Explain how your program works in your own words.</blockquote>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing XML in Python\n", "\n", "88. You may be thinking, “…but we already have an XML file with this data in it. Can we use Python to with .xml?” The answer to this question is YES! \n", "\n", "89. In this next task, we’ll use the ElementTree API https://docs.python.org/3/library/xml.etree.elementtree.html. This is part of the Python library that has been written specifically to parse XML. \n", "\n", "90. We’ll only work with a few functions and methods from ElementTree as an introduction to the tool.\n", "\n", "```XML\n", "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", "<books>\n", " <book>\n", " <title>CSS: The Definitive Guide\n", " Eric Meyer\n", " 2007\n", " \n", " \n", " Learning XML\n", " Erik Ray\n", " 2003\n", " \n", "\n", "```\n", "\n", "91. First, we need to import the ElementTree module into our file so that we can continue to use the methods and functions associated with ElementTree. Type this first line of code into a new file called `XML.py` (or whatever you’d like to call it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "92. Now we need to import our .xml file. Remember, you’ll need to include the entire file path for your .xml file if it is not in the same folder as your python file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET\n", "tree = ET.parse('books.xml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "93. And, finally, we need to get the root element of our xml file. Remember in our last project we also had to name the root element in our XSL file. This tag that is at the top of the hierarchy. In the example file, the root is ``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET\n", "tree = ET.parse('books.xml')\n", "root = tree.getroot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with XML in Python\n", "\n", "94. Now we are ready to work with our `.xml` file in Python. Try adding `print` commands." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET\n", "tree = ET.parse('books.xml')\n", "root = tree.getroot()\n", "\n", "print(root.tag)\n", "print(root.attrib)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "95. This program returns the output:\n", "```\n", "books\n", "{}\n", "```\n", "\n", "96. This set of commands returns the tag for the root element `books` and an empty set of braces for the attribute because there is not an attribute associated with the `books` tag. \n", "\n", "97. You probably didn’t assign attributes to your XML tags, so we’ll keep working with tags. If you have attributes in your file, you can consult the documentation for ElementTree to modify your code.\n", "\n", "98. This code gave us the root element, but what if we want to see the structure of our document? We can use the `iter` function to pull all of the elements from our XML file in a simple loop that outputs each tag and the text value associated with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET\n", "tree = ET.parse('books.xml')\n", "root = tree.getroot()\n", "\n", "getIterator = root.iter()\n", "\n", "for element in getIterator:\n", " print(element.tag, element.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "99. This program outputs:\n", "```\n", "books\n", "\n", "book\n", "\n", "title CSS: The Definitive Guide\n", "date 2007\n", "author Eric Meyer\n", "book\n", "\n", "title Learning XML\n", "date 2003\n", "author Erik Ray\n", "```\n", "\n", "100. Now let’s get some data from the file. This next program creates a loop that returns all the titles in the file. \n", "\n", "101. In the sample XML file, `` tag is the child of `` and `` is a child of `<book>`. \n", "\n", "102. This loop says for each work, assign text from the `title` element to the variable `title`, and then print the titles." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET\n", "tree = ET.parse('books.xml')\n", "root = tree.getroot()\n", "\n", "for book in root.findall('book'):\n", " title = book.find('title').text\n", " print(title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "103. This program outputs:\n", "```\n", "CSS: The Definitive Guide\n", "Learning XML\n", "```\n", "104. But what if we want to pull the title and date? We can modify the code to also pull the information from the `<year>` tag." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xml.etree.cElementTree as ET\n", "tree = ET.parse('books.xml')\n", "root = tree.getroot()\n", "\n", "for book in root.findall('book'):\n", " title = book.find('title').text\n", " date = book.find('year').text\n", " \n", " print(title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<blockquote>Q4: What do you expect this program to output? Why? Explain how this code works in your own words.</blockquote>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<blockquote>Q5: Write a similar program for the .xml file that you created in the last exercise. Pull data from at least two elements. Copy your code and your output in your notebook and explain what your code does (or is attempting to do).</blockquote>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing to XML from Python\n", "\n", "105. When working with JSON, `json.dumps()` made it relatively straightforward to transform a Python dictionary back into a JSON object.\n", "\n", "106. Going from a Python dictionary to XML is less straightforward. In fact, you'd want to use the [`dicttoxml` module](https://pypi.org/project/dicttoxml/) specifically designed to help with that workflow.\n", "\n", "107. For now, we're going to demonstrate how you would create the XML structure manually in Python using Element Tree and then write that to an XML file.\n", "\n", "108. You create the root element using `ET.Element()`. \n", "\n", "109. Then you can create sub-elements (nested within the root element) using `ET.SubElement()`. \n", "\n", "110. The `SubElement()` function lets us specify parent elements, tags, and attribute: `SubElement(parent, tag, attrib={}, **extra)`\n", "\n", "111. In this example, `parent` is the parent node for the sub-element. `attrib` is a dictionary with any element attributes. `extra` are any additional keyword arguments being passed to the `SubElement()` function.\n", "\n", "112. Then we would use our standard file `open()` and `write()` operations to write the newly-created structure to an XML file.\n", "\n", "113. To put this all together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import components of Element Tree package\n", "import xml.etree.ElementTree as ET\n", "\n", "# create root element\n", "root = ET.Element('root')\n", "\n", "# create sub elements\n", "items = ET.SubElement(root, 'items')\n", "\n", "# create child elements for items\n", "item1 = ET.SubElement(items, 'item')\n", "item2 = ET.SubElement(items, 'item')\n", "item3 = ET.SubElement(items, 'item')\n", "\n", "# set child element item names\n", "item1.set('name', 'item1')\n", "item2.set('name', 'item2')\n", "item3.set('name', 'item3')\n", "\n", "# set text for child item elmeents\n", "item1.text = 'item1abc'\n", "item2.text = 'item2abc'\n", "\n", "# pass XML document to a string\n", "myData = ET.tostring(root)\n", "\n", "# create new XML file\n", "myFile = open('items.xml', 'wb')\n", "\n", "# write XML string to file\n", "myFile.write(myData)\n", "\n", "#close file\n", "myFile.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<blockquote>Q6: Create a small XML structure and write it to an XML file.</blockquote>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## XML Project Prompt\n", "\n", "114. Navigate to an open data portal and download an XML file. \n", "\n", "115. A few places to start:\n", "- [Data.gov](https://www.data.gov/)\n", "- [City of Chicago Data Portal](https://data.cityofchicago.org/)\n", "- [City of South Bend Open Data](https://data-southbend.opendata.arcgis.com/)\n", "- [Library of Congress XML finding aids](https://findingaids.loc.gov/source/main)\n", "- [National Library of Scotland](https://data.nls.uk/)\n", "- [National Archives and Records Administration (NARA)](https://www.archives.gov/developer#toc--datasets)\n", "- [Natural History Museum Data Portal](https://data.nhm.ac.uk/)\n", "- Various TEI projects\n", " * [Digital Archives and Pacific Culture](http://pacific.pitt.edu/InjuredIsland.html)\n", " * [The Digital Temple](https://digitaltemple.rotunda.upress.virginia.edu/)\n", " * [The Diary of Mary Martin](https://dh.tcd.ie/martindiary/) \n", " * [Toyota City Imaging Project](http://www.bodley.ox.ac.uk/toyota/)\n", " * [African American Women Writers](http://digital.nypl.org/schomburg/writers_aa19/)\n", " * [The Walt Whitman Archive](https://whitmanarchive.org/)\n", " * [Michigan State University, Feeding America: The Historic American Cookbook Dataset](https://lib.msu.edu/feedingamericadata/)\n", "\n", "116. Open the data in a text editor. Describe what are you seeing. How can we start to make sense of this data? What documentation is available?\n", "\n", "117. Read the XML data into Python and convert to a Python value.\n", "\n", "# Lab Notebook Questions\n", "\n", "Q1: Decipher what we're seeing in the JSON here. What are the name/value pairs, and how are they organized in this object?\n", "\n", "Q2: Describe the structure of this XML document in your own words.\n", "\n", "Q3: Write a similar dictionary for the <code>book.xml</code> file contained in this repo and generate some output. Include code + comments. Explain how your program works in your own words.\n", "\n", "Q4: What do you expect this program out output? Why? Explain how this code works in your own words.\n", "\n", "Q5: Write a similar program for the .xml file that you created in the last exercise. Pull data from at least two elements. Copy your code and your output in your notebook and explain what your code does (or is attempting to do).\n", "\n", "Q6: Create a small XML structure and write it to an XML file." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 4 }