{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n", "For questions/comments/improvements, email nathan.kelber@ithaka.org.
\n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Working with Dataset Files**\n", "\n", "**Description:** This notebook describes how to:\n", "* Read and write files (.txt, .csv, .json)\n", "* Use the `tdm_client` to read in metadata\n", "* Use the `tdm_client` to read in data\n", "\n", "This notebook describes how to read and write text, CSV, and JSON files using Python. Additionally, it explains how the `tdm_client` can help users easily load and analyze their datasets.\n", "\n", "**Use Case:** For Learners (Detailed explanation, not ideal for researchers)\n", "\n", "**Difficulty:** Intermediate\n", "\n", "**Completion Time:** 60 minutes\n", "\n", "**Knowledge Required:** \n", "* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))\n", "\n", "**Knowledge Recommended:** None\n", "\n", "**Data Format:** Text (.txt), CSV (.csv), JSON (.json)\n", "\n", "**Libraries Used:**\n", "* `pandas` to read and write CSV files\n", "* `json` to read and write JSON files\n", "* `tdm_client` to retrieve and read data\n", "\n", "**Research Pipeline:** None\n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Files in Python\n", "Working with files is an essential part of Python programming. When we execute code in Python, we manipulate data through the use of variables. When the program is closed, however, any data stored in those variables is erased. To save the information stored in variables, we must learn how to write it to a file.\n", "\n", "At the same time, we may have notebooks for applying specific analyses, but we need to have a way to bring data into the notebook for analysis. Otherwise, we would have to type all the data into the program ourselves! Both reading-in from files and writing-out data out to files are important skills for data science and the digital humanities.\n", "\n", "This section describes how to work with three kinds of common data files in Python:\n", "* Plain Text Files (.txt)\n", "* Comma-Separated Value files (.csv)\n", "* Javascript Object Notation files (.json)\n", "\n", "Each of these filetypes are in wide use in data science, digital humanities, and general programming. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Three Common Data File Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plain Text Files (.txt)\n", "A plain text file is one of the simplest kinds of computer files. Plain text files can be opened with a text editor like Notepad (Windows 10) or TextEdit (OS X). The file can contain only basic textual characters such as: letters, numbers, spaces, and line breaks. Plain text files do not contain styling such as: heading sizes, italic, bold, or specialized fonts. (To including styling in a text file, writers may use other file formats such as rich text format (.rtf) or markdown (.md).)\n", "\n", "Plain text files (.txt) can be easily viewed and modified by humans by changing the text within. This is an important distinction from binary files such as images (.jpg), archives (.gzip), audio (.wav), or video (.mp4). If a binary file is opened with a text editor, the content will be largely unreadable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comma-Separated Value Files (.csv)\n", "A comma-separated value file is also a text file that can easily be modifed with a text editor. A CSV file is generally used to store data that fits in a series or table (like a list or spreadsheet). A spreadsheet application (like Microsoft Excel or Google Sheets) will allow you to view and edit a CSV data in the form of a table.\n", "\n", "Each row of a CSV file represents a single row of a table. The values in a CSV are separated by commas (with no space between), but other delimiters can be chosen such as a tab or pipe (|). A tab-separated value file is called a TSV file (.tsv). Using tabs or pipes may be preferable if the data being stored contains commas (since this could make it confusing whether a comma is part of a single entry or a delimiter between entries).\n", "\n", "### The text contents of a sample CSV file\n", "```\n", "Username,Login email,Identifier,First name,Last name\n", "booker12,rachel@example.com,9012,Rachel,Booker\n", "grey07,,2070,Laura,Grey\n", "johnson81,,4081,Craig,Johnson\n", "jenkins46,mary@example.com,9346,Mary,Jenkins\n", "smith79,jamie@example.com,5079,Jamie,Smith\n", "```\n", "### The same CSV file represented in Google Sheets:\n", "\n", "![CSV table view in Google Sheets](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/csv_in_sheets.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## JavaScript Object Notation (.json)\n", "A Javascript Object Notation file is also a text file that can be modified with a text editor. A JSON file stores data in key/value pairs, very similar to a Python dictionary. One of the key benefits of JSON is its compactness which makes it ideal for exchanging data between web browsers and servers.\n", "\n", "While smaller JSON files can be opened in a text editor, larger files can be difficult to read. Viewing and editing JSON is easier in specialized editors, available online at sites like: \n", "\n", "* [JSON Formatter](http://jsonformatter.org)\n", "* [JSON Editor Online](https://jsoneditoronline.org/)\n", "\n", "A JSON file has a nested structure, where smaller concepts are grouped under larger ones. Like extensible markup language (.xml), a JSON file can be checked to determine that it is valid (follows the proper format for JSON) and/or well-formed (follows the proper format defined in a specialized example, called a schema). \n", "\n", "### The text contents of a sample JSON file\n", "\n", "```\n", "{\n", " \"firstName\": \"Julia\",\n", " \"lastName\": \"Smith\",\n", " \"gender\": \"woman\",\n", " \"age\": 57,\n", " \"address\": {\n", " \"streetAddress\": \"11434\",\n", " \"city\": \"Detroit\",\n", " \"state\": \"Mi\",\n", " \"postalCode\": \"48202\"\n", " },\n", " \"phoneNumbers\": [\n", " { \"type\": \"home\", \"number\": \"7383627627\" }\n", " ]\n", "}\n", "```\n", "### The same JSON file represented in JSON Editor Online\n", "![An image of the JSON file showing the structure](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/json_editor.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Opening, Reading, and Writing Text Files (.txt)\n", "\n", "## Open the File\n", "\n", "Before we can read or write to text file, we must open the file. Normally, when we open a file, a new window appears where we can see the contents. In Python, opening a file simply means create a *file object* that refers to the particular file. When the file has been opened, we can read or write to the file. Finally, we must close the file. Here are the three steps:\n", "\n", "1. Use the open() function to create a file object\n", "2. Use the .read(), .readlines(), or .write() method on the file object\n", "3. Use the close() function to close the file object\n", "\n", "Let's practice on `sample.txt`, a sample text file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Open the text file `sample.txt` creating\n", "# a file object called `f`\n", "\n", "f = open('data/sample.txt', 'r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have created a file object called `f`. The first argument (`'sample.txt'`) is a string containing the file name. You can see the sample.txt in the same directory as this lesson. If your file was called reports.txt, you would replace that argument with `'reports.txt'`. The second argument (`'r'`) determines that we are opening the file in \"read\" mode, where the file can be read but not modified. There are three main modes that can be specified:\n", "\n", "|Argument|Mode Name|Description|\n", "|---|---|---|\n", "|'r'|read|Reads file but no writing allowed (protects file from modification)|\n", "|'w'|write|Writes to file, overwriting any information in the file (saves over the current version of the file|\n", "|'a'|append|Appends (or adds) information to the end of the file (new information is added below old information)|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read the File\n", "\n", "### .read() method\n", "Now that we have a file object `f` opened in \"read mode,\" let's read the contents with the `.read()` method. We will create a variable called `file_contents` to hold the data that we are reading in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a variable called `file_contents`\n", "# that will hold the result of using the\n", "# .read() method on our file object\n", "file_contents = f.read()\n", "print(file_contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we are finished with the file object, we must close it using the .close() method on it. It is very important to always close a file, otherwise your program may crash or create memory problems." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Close the file by using the .close() method\n", "# on the file object\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .readlines() method\n", "\n", "If a file is very large, we may want to read a single line at a time so as not to fill all of the available computer memory. To read a single line at a time, we can use the `.readlines()` method instead of the `.read()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Open the file sample.txt in read mode\n", "# creating the file object `f`\n", "f = open('data/sample.txt', 'r')\n", "file_contents = f.readlines()\n", "print(file_contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the `.read()` method, we read in the whole text file as a single string. The `.readlines()` gives us a Python list, where each item is a single string representing a single line of the text document. You may also notice that each line ends with `\\n` which represents a line break in a string. If we print the string, the line break is visible in our output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the first item in the file_contents list\n", "# Note the \\n turns into a line break\n", "print(file_contents[0])\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write to the File\n", "To write to a file, we need to open it and create our file object in either write ('w') or append ('a') mode. \n", "\n", "### Append mode\n", "Let's start with append mode which adds new data to the bottom of the file while leaving any previous information intact." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Opening a file in append mode\n", "# and creating a new file object \n", "# called `f`\n", "f = open('data/sample.txt', 'a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the `.write()` method to append a string to the file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Appending an eleventh line to the file\n", "f.write('\\nThis is the eleventh line')\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you read the file back in to see whether the `.write()` was successful?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Open the the file in read mode\n", "# create a file object called `sample_file`\n", "f = \n", "\n", "# Use the .read() method on the file object\n", "# Store the result in a variable `file_contents`\n", "file_contents = \n", "\n", "# Print the contents\n", "print(file_contents)\n", "\n", "# Close the file\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write mode\n", "Opening a file in write mode is useful in two scenarios:\n", "* Creating a new text file and writing data to it\n", "* Overwriting all data in the file with new data\n", "\n", "Here is an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating a new file in write mode\n", "f = open('data/new_sample.txt', 'w')\n", "\n", "# Define a string variable to add to the new file\n", "string = 'Here is some data\\nWith a second line'\n", "\n", "# Using write method on the file object\n", "contents = f.write(string)\n", "\n", "# Close file object\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Open/Close Files `with open`\n", "The `with open` technique is commonly used in Python because it has two significant advantages:\n", "* It is more compact \n", "* It automatically closes the file afterward\n", "\n", "The basic form resembles a flow control statement, ending in a colon and then executing an indented block of code. After the block executes, the file is closed automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('data/sample.txt', 'r') as f:\n", " print(f.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Opening, Reading, and Writing CSV Files (.csv)\n", "CSV file data can be easily opened, read, and written using the `pandas` library. (For large CSV files (>500 mb), you may wish to use the `csv` library to read in a single row at a time to reduce the memory footprint.) Pandas is flexible for working with tabular data, and the process for importing and exporting to CSV is simple." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import pandas \n", "import pandas as pd\n", "\n", "# Create our dataframe\n", "df = pd.read_csv('data/sample.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display the dataframe\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you've made any necessary changes in Pandas, write the dataframe back to the CSV file. (Remember to always back up your data before writing over the file.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write data to new file\n", "# Keeping the Header but removing the index\n", "df.to_csv('data/new_sample.csv', header=True, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Opening, Reading, and Writing JSON Files (.json)\n", "\n", "JSON files use a key/value structure very similar to a Python dictionary. We will start with a Python dictionary called `py_dict` and then write the data to a JSON file using the `json` library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Defining sample data in a Python dictionary\n", "py_dict = {\n", " \"firstName\": \"Julia\",\n", " \"lastName\": \"Smith\",\n", " \"gender\": \"woman\",\n", " \"age\": 57,\n", " \"address\": {\n", " \"streetAddress\": \"11434\",\n", " \"city\": \"Detroit\",\n", " \"state\": \"Mi\",\n", " \"postalCode\": \"48202\"\n", " },\n", " \"phoneNumbers\": [\n", " { \"type\": \"home\", \"number\": \"7383627627\" }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To write our dictionary to a JSON file, we will use the `with open` technique we learned that automatically closes file objects. We also need the `json` library to help dump our dictionary data into the file object. The `json.dump` function works a little differently than the write method we saw with text files. \n", "\n", "We need to specify two arguments: \n", "\n", "* The data to be dumped\n", "* The file object where we are dumping" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Open/create sample.json in write mode\n", "# as the file object `f`. The data in py_dict\n", "# is dumped into `f` and then `f` is closed\n", "\n", "import json\n", "with open('data/sample.json', 'w') as f:\n", " json.dump(py_dict, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To read data in from a JSON file, we can use the `json.load` function on our file object. Here we load all the content into a variable called `content`. We can then print values based on particular keys." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('data/sample.json') as f:\n", " contents = json.load(f)\n", " print('First Name: ' + contents['firstName'])\n", " print('Last Name: '+ contents['lastName'])\n", " print('Age: ' + str(contents['age']))\n", " print('Phone Number: ', contents['phoneNumbers'][0]['number'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Opening datasets with `tdm_client`\n", "\n", "The `tdm_client` helps retrieve a given dataset and/or its associated metadata. The metadata is supplied in a CSV file and the full dataset is supplied in a compressed JSON Lines file (.jsonl.gz). For any analysis focused on metadata pre-processing, we recommend users start with the CSV file since it is both easier and faster to view, parse, and manipulate.\n", "\n", "\n", "## Metadata CSV vs. JSON Lines Data File\n", "All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:\n", "\n", "1. The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas. It is nice to be able to easily view all the metadata in a Pandas dataframe.\n", "2. The JSON Lines data can be very large. Each file contains all of the metadata *plus* unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computer time and costs. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.\n", "\n", "More information is available, including the metadata categories, in the FAQ [\"What is the data file format?\"](https://docs.tdm-pilot.org/what-format-are-jstor-portico-datasets/). \n", "\n", "## Retrieving data (`tdm_client` methods)\n", "\n", "By passing the `tdm_client` a dataset ID (here called `dataset_id`), we can automatically download the metadata CSV file or the full JSON Lines dataset created by the [dataset builder](https://tdm-pilot.org/builder/).\n", "\n", "* Use the `.get_metadata()` method to retrieve the metadata CSV file\n", "* Use the `get_dataset()` method to retrieve the full JSON data file\n", "\n", "|Code|Result|\n", "|---|---|\n", "|f = tdm_client.get_metadata(dataset_id)|Automatically retrieves a metadata CSV file and creates a file object `f`|\n", "|f = tdm_client.get_dataset(dataset_id)| Automatically retrieves a compressed JSON Lines dataset file (jsonl.gz) and creates a file object `f`|\n", "\n", "The JSON Lines file will be downloaded in a compressed gzip format (jsonl.gz). We can iterate over each document in the corpus by using the `dataset_reader()` method.\n", "\n", "## Reading in data from a dataset object\n", "\n", "The `dataset_reader()` method will read in data from the compressed JSON dataset file object. By keeping the data in a compressed format and reading in a single line at a time, we reduce the processing time and memory use. These can be substantial for large datasets. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.\n", "\n", "The `dataset_reader()` essentially unzips each row of the dataset at a time. Each row constitutes all the metadata and data available for a single document. Here is what that looks like the actual tdm_client:\n", "\n", "```with gzip.open(file_path, \"rb\") as input_file:\n", " for row in input_file:\n", " yield json.loads(row)\n", "```\n", "\n", "In most cases, users will want to iterate over every document/row in the JSON Lines file. In practice, that looks like:\n", "\n", "|Code|Result|\n", "|---|---|\n", "|for document in tdm_client.dataset_reader(f):| Iterates over each document in file object `f`|\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": null }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "312px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }