{ "cells": [ { "cell_type": "markdown", "source": [ "# Chapter 16: Data formats I (CSV and TSV)\n", "\n", "In the subsequent chapters, we will introduce you to several data formats. In the previous chapters, you have already worked with plain text. Now, we are going to work with: \n", "\n", "* tables (CSV/TSV) \n", "* JSON\n", "* XML \n", "\n", "These formats are simply the result of agreements that were made between people on how to organize and store data. Some formats, such as XML and RDF, have a high degree of structure, whereas plain text is a typical example of unstructured data. Structuring data according to predefined specifications allows information in the data to be easily ordered and processed by machines. You can compare highly structured data with a perfectly organized filing cabinet where everything is identified, labeled, and easy to access.\n", "\n", "**In general, you can treat any of these data formats as a text file with certain regularities people have agreed on**. You can simply read in the file and use the tools you already to exploit these regularities and access information. For some data formats with rather complex structures (such as XML), it is more convenient to use existing python packages to extract information. \n", "\n", "This notebook introduces tabular formats: CSV/TSV. TSV/CSV is a highly transparent way of structuring data in tables. It is a rather straightforward structure and does not necessarily require specific Python packages. \n", "\n", "**At the end of this chapter, you will be able to:**\n", "* read CSV/TSV data\n", "* manipulate CSV/TSV data\n", "* write CSV/TSV data\n", "\n", "**If you want to learn more about these topics, you might find the following links useful:**\n", "* [Tutorial: Reading and Manipulating CSV Files](https://www.protechtraining.com/blog/post/python-for-beginners-reading-manipulating-csv-files-737)\n", "* If you want more Raymond Hettinger content and learn about his opinion on what makes good python programers, [this video](https://www.youtube.com/watch?v=UANN2Eu6ZnM) is for you.\n", "\n", "\n", "If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 1. Introduction to CSV and TSV (tables)\n", "The **table** is probably one of the most common and intuitive data formats. Undoubtedly, you have already worked with tabular data in Excel, Numbers, or Google Sheets. A table represents a set of data points as a series of rows, with a column for each of the data points' properties. In other words: a table has vertical **columns** (often identifiable by name) and horizontal **rows**, and the **cells** are the unit where a row and column intersect. This is a very simple example:\n", "\n", "| name | house | patronus | \n", "|----------|------------|----------| \n", "| Harry | Gryffindor | stag | \n", "| Hermione | Gryffindor | otter | \n", "| Draco | Slytherin | unknown | \n", "| Hannah | Hufflepuff | unknown | \n", "\n", "Tabular data can be encoded as **CSV (comma-separated values) or TSV (tab-separated values)**. CSV and TSV files are simply plain text files in which each line represents a row, and the columns are separated by a comma (for CSV) or a tab character (for TSV).\n", "\n", "\n", "\n" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "You can find examples of CSV files in the folder `../Data/csv_data`. For instance, look at the files in `../Data/csv_data/baby_names/names_by_state/`. If you like, open them in a text editor (e.g., [Atom](https://atom.io/), [BBEdit](https://www.barebones.com/products/bbedit/download.html), or [Notepad++](https://notepad-plus-plus.org)), or Excel (convert text to columns by using the comma as delimiter) to see their content. \n", "\n", "For example, these are the first 10 rows of the CSV file `AK.csv`:\n", "\n", "\n", "![box](./images/csv.png) \n", "\n", "A TSV file would look like this (note that the tab separator is represented as '\\t', which is not shown here):\n", "\n", "\n", "![box](./images/tsv.png) \n", "\n", "\n", "The file `AK.csv` contains a list of names given to children in the state of Alaska from 1910 to 2015 with their frequency. Each line in this file has five elements, which are separated by commas: \n", "\n", "- the state abbreviation (AK for Alaska)\n", "- gender (F/M)\n", "- year\n", "- name\n", "- frequency of that name in the given year and state \n", "\n", "Below, we will work towards representing this data in Python as a **list of lists** (i.e., a nested list) or as a **list of dicts**. In both cases, the elements of the (first) list represent the complete rows. The individual rows, then, can be either represented as a list (without column names) or as a dictionary (with column names).\n", "\n", "| LIST OF LISTS | LIST OF DICTS |\n", "|:-------------------------:|:-------------------------:|\n", "![box](./images/list_of_lists.png) | ![box](./images/list_of_dicts.png)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 2. Reading CSV files\n", "\n", "Because CSV/TSV files are essentially text files, we can open and read them in the same way as we have seen before:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Read the file just as we read a text file\n", "filename = \"../Data/csv_data/baby_names/names_by_state/AK.csv\"\n", "with open(filename, \"r\") as csvfile:\n", " content = csvfile.read()" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# we've now stored the content as a string:\n", "print(type(content))" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# since the table is pretty long, we're slicing the string and only printing the first 200 characters\n", "print(content[:200])" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Please also have a close look at the internal representation of the file. Do you see how the columns are separated by commas, and the rows by newline characters `\\n`?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "print(repr(content[:200]))" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 2.1 Reading rows as lists\n", "Now, let's see how we can get to the 'list of lists' representation. We can do that by iterating over each line of this file (as we have seen before), and then split each row into columns using the `split()` method:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Read the file and get all lines\n", "\n", "# assign new line character to a variable\n", "# (we are going to need this when processing the lines of the file)\n", "#new line on linux/unix\n", "newline_char = '\\n'\n", "#new line on windows (double-check this in an editor or use repr())\n", "#newline_char = '\\r\\n'\n", "\n", "# create empty list to collect the rows:\n", "csv_data = []\n", "# store path to file in variable\n", "filename = \"../Data/csv_data/baby_names/names_by_state/AK.csv\"\n", "# open file and read in line-by line\n", "with open(filename, \"r\") as csvfile:\n", " for row in csvfile:\n", " # remove new-line charater - \n", " row = row.strip(newline_char) \n", " # separate the cells on comma (csv file)\n", " cells = row.split(\",\") # split the line into cells\n", " csv_data.append(cells) \n", "\n", "# Print only first 10 rows\n", "print(csv_data[:10])\n", "print()\n", "\n", "# Iterate over first 10 rows - this is more readable\n", "for row in csv_data[:10]:\n", " print(row)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "It worked! The variable `csv_data` now contains a list of all rows in the file. Now we can easily work with the data by using the indices of the lists to access cells. Have a look at the following examples:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Example: print all information of the first 10 rows\n", "for row in csv_data[:10]:\n", " state = row[0]\n", " gender = row[1]\n", " year = row[2]\n", " name = row[3]\n", " frequency = row[4]\n", " print(state, gender, year, name, frequency)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Example: print all names given in 1912\n", "for row in csv_data:\n", " year = row[2]\n", " name = row[3]\n", " # note that the year is represented as a string (not an int)\n", " if year == \"1912\":\n", " print(name)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 2.2 Reading rows as dicts\n", "\n", "We can also create a 'list of dicts'. We do this by first creating a dictionary for each row, and appending that dictionary to the list of rows. We can simply use the column headers as keys, so we can easily access the corresponding values later on. \n", "\n", "Note: Do you see that we also convert the numerical values to `int`?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Read the file and get all lines\n", "\n", "# again, create a list to collect the rows:\n", "csv_data = []\n", "\n", "filename = \"../Data/csv_data/baby_names/names_by_state/AK.csv\"\n", "with open(filename, \"r\") as csvfile:\n", " for row in csvfile:\n", " row = row.strip(\"\\n\") # remove all newlines\n", " columns = row.split(\",\") # split the line into columns\n", " \n", " # Create a dictionary and add to list\n", " dict_row = {\"state\": columns[0],\n", " \"gender\": columns[1],\n", " \"year\": int(columns[2]),\n", " \"name\": columns[3],\n", " \"frequency\": int(columns[4])}\n", " csv_data.append(dict_row)\n", "\n", "# First 10 rows\n", "print(csv_data[:10])\n", "print()\n", "\n", "# Iterate over first 10 rows \n", "for row in csv_data[:10]:\n", " print(row)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Again, we can now easily work with the data, but now we use the *names* of the columns instead of indices to access the cells. This can make the code more readable. Have a look at the following examples:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Example: print all information of the first 5 rows\n", "for row in csv_data[:5]:\n", " for column_name, cell_value in row.items():\n", " print(column_name, \"=\", cell_value)\n", " print()" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Example: print all names given in 1912\n", "for row in csv_data:\n", " if row[\"year\"] == 1912:\n", " print(row[\"name\"])" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "It does not really matter whether you choose for a 'list of lists' or a 'list of dicts'. Just use the one that you prefer." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 3. Writing CSV files\n", "\n", "Let's say now we have a table in Python stored as a 'list of lists' or as a 'list of dicts', and we want to store our result in a CSV file. This is basically the inverse process of reading a CSV file. " ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 3.1 Writing rows as lists\n", "\n", "In order to write a list of lists as a CSV file, we need to iterate over the rows and make a string out of them. Remember that we can concatenate strings in a list with any separator with the `join()` method:" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "a_list = [\"John\", \"john@example.nl\", \"555-1234\"]\n", "a_string = \",\".join(a_list)\n", "print(a_string)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "In addition, we should use the newline character `\\n` to write each row on a line." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Run the cell below and check whether it worked by opening the file `../Data/address_book.csv` in a text editor. Is everything correct? Did you use the correct new-line character?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# assign new line character to a variable\n", "# (we are going to need this when processing the lines of the file)\n", "#new line on linux/unix\n", "newline_char = '\\n'\n", "#new line on windows (double-check this in an editor or use repr())\n", "#newline_char = '\\r\\n'\n", "\n", "# Create list of lists\n", "address_book = [\n", " [\"John\", \"john@example.nl\", \"555-1234\"],\n", " [\"William\", \"william@example.nl\", \"555-5678\"],\n", " [\"Jane\", \"jane@example.nl\", \"555-7777\"]\n", "]\n", "\n", "# Write the list of lists to a CSV file\n", "outfilename = \"../Data/address_book.csv\"\n", "with open(outfilename, \"w\") as outfile:\n", " for row in address_book:\n", " line = f'{\",\".join(row)}{newline_char}'\n", " outfile.write(line)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 3.2 Writing rows as dicts\n", "\n", "In order to write a list of dicts to a CSV file, we need to first get all the values in each dictionary. The rest works exactly the same. In the following code, we use the tab separator `\\t` and save it with the `.tsv` extension." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Create list of dicts\n", "address_book = [\n", " {\"name\":\"John\", \"e-mail\":\"john@example.nl\", \"phone\":\"555-1234\"},\n", " {\"name\":\"William\", \"e-mail\":\"william@example.nl\", \"phone\":\"555-5678\"},\n", " {\"name\":\"Jane\", \"e-mail\":\"jane@example.nl\", \"phone\":\"555-7777\"}\n", "]\n", "\n", "\n", "# Write the list of dicts to a TSV file\n", "outfilename = \"../Data/csv_data/address_book.tsv\"\n", "with open(outfilename, \"w\") as outfile: \n", " \n", " # Write the rows using the values of the dictionaries\n", " for row in address_book:\n", " # get the values in each cell:\n", " # (assuming the order is always the same)\n", " cell_values = row.values()\n", " # join them using a tab (for tsv)\n", " line = \"\\t\".join(cell_values) + '\\n'\n", " outfile.write(line)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## 4. Dealing with column headers\n", "\n", "Many csv/tsv files contain headers. This means that the first row contains the names of the columns. We easily read and write these kinds of files without having to type the headers manually. \n", "\n", "\n", "### 4.1 Reading csv/tsv files with a header row\n", "Consider the file called Concreteness_ratings_Brysbaert_et_al_BRM.txt (we're reading it in and printing the first 5 lines below). \n", "\n", "The file has 10 columns, each of which have a column **header**. This means that all the values in the first row are not actual values. We need to treat them differently if we want to analyze the data (i.e. everything from the second row on). In addition, it would be convenient to use this information right away, without having to first inspect the file and manually type the headers as keys in dictionaries!" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "with open('../Data/csv_data/Concreteness_ratings_Brysbaert_et_al_BRM.tsv') as infile:\n", " lines = infile.read().split('\\n')\n", " \n", "for line in lines[:5]:\n", " print(line)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Can you complete the example below?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "data_dict_list = []\n", "\n", "with open('../Data/csv_data/Concreteness_ratings_Brysbaert_et_al_BRM.tsv') as infile:\n", " lines = infile.read().split('\\n')\n", "\n", "# the first row is at index 0 and contains the headers\n", "headers = lines[0]\n", "\n", "# we consider everythin following the header row as data\n", "for line in lines[1:3]:\n", " line_list = line.split('\\t')\n", " header_list = headers.split('\\t')\n", " row_dict = dict()\n", " \n", " # can you think of a way to fill the row dict? \n", " # hint: check out the zip function\n", " # zip for iterating over two lists at the same time\n", " \n", " ### YOUR CODE ####\n", " \n", " data_dict_list.append(row_dict)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### 4.2 Writing csv/tsv files with a header row\n", "\n", "You can also write files with headers: \n" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "row1 = {'name': 'Harry', 'house' : 'Gryffindor'}\n", "row2 = {'name': 'Ron', 'house' : 'Gryffindor'}\n", "row3 = {'name': 'Hannah', 'house': 'Hufflepuff'}\n", "row4 = {'name': 'Leta', 'house' : 'Slytherin'}\n", "\n", "data_dict_list = [row1, row2, row2, row4]\n", "\n", "for d in data_dict_list:\n", " print(d)\n", " \n", "header_row = ### your code here\n", "\n", "with open('../Data/csv_data/hp_example.csv', 'w') as outfile:\n", " # write header\n", " for d in data_dict_list:\n", " row_list = d.values()\n", " row = ','.join(values)\n", " outfile.write(row+'\\n')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## A note on csv/tsv files containing a lot of textual data\n", "\n", "If you're dealing with tables containing textual data (e.g. full sentences or tweets), you will realize that splitting on commas or tabs may result in confusion (as text often contains commas and may contain tabs). Consider for instances this file: `../Data/csv_data/debate.csv`.\n", "\n", "You may have ideas for a fix regarding this specific file and surely you will eventually find a good solution for it. However, for these more complex CSV files, we recommend that you make use of the **[csv module](https://docs.python.org/3/library/csv.html)**. Feel free to explore it yourself, but do not use it for the exercises and assignments in this course." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Exercises" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 1:\n", "Have another look at the code below. Can you predict what would happen if you skip some of the steps? How will the data be different?" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "filename = \"../Data/csv_data/baby_names/names_by_state/AK.csv\"\n", "with open(filename, \"r\") as csvfile:\n", " csv_data = []\n", " for row in csvfile:\n", " #row = row.strip(\"\\n\") # what happens if you skip this step?\n", " columns = row.split(\",\") \n", " csv_data.append(columns) \n", "print(csv_data[0:2])" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "filename = \"../Data/csv_data/baby_names/names_by_state/AK.csv\"\n", "with open(filename, \"r\") as csvfile:\n", " csv_data = []\n", " for row in csvfile:\n", " row = row.strip(\"\\n\") \n", " #columns = row.split(\",\") # what happes if you skip this step?\n", " csv_data.append(row) # replaced columns by row\n", "print(csv_data[0:2])" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 2:\n", "Read the csv data in `AK.csv` and store it as a **list of lists**. Now print the following:\n", "- all names that started with an M given in 1990\n", "- all unique female names (hint: create a set)\n", "- all names that were given more than 30 times in a certain year (print name+year)\n", "- all unique names longer than 8 letters given between 1985 and 1990" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Read the csv data in `AK.csv` and store it as a **list of dicts**. Now print the following:\n", "- all names that started with an M given in 1990\n", "- all unique female names (hint: create a set)\n", "- all names that were given more than 30 times in a certain year (print name+year)\n", "- all unique names longer than 8 letters given between 1985 and 1990" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 3:\n", "Can you think of a way to add a header to the TSV file below? Hint: make use of the dictionary keys." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Create list of dicts\n", "address_book = [\n", " {\"name\":\"John\", \"e-mail\":\"john@example.com\", \"phone\":\"555-1234\"},\n", " {\"name\":\"William\", \"e-mail\":\"william@example.com\", \"phone\":\"555-5678\"},\n", " {\"name\":\"Jane\", \"e-mail\":\"jane@example.com\", \"phone\":\"555-7777\"}\n", "]\n", "\n", "# Write the list of dicts to a TSV file\n", "outfilename = \"../Data/csv_data/address_book.tsv\"\n", "with open(outfilename, \"w\") as outfile: \n", " \n", " # Write the header\n", " # your code here\n", " \n", " # Write the rows using the values of the dictionaries\n", " for row in address_book:\n", " column_values = row.values()\n", " line = \"\\t\".join(column_values) + '\\n'\n", " outfile.write(line)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 4: \n", "Now *append* information about Jennifer and Justin (stored as a list) at the bottom of this `address_book.tsv` file (attention: make sure you do not overwrite the contents of the file)." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "more_people = [\n", " [\"Jennifer\", \"jennifer@example.com\", \"555-9876\"],\n", " [\"Justin\", \"justin@example.com\", \"555-5555\"]\n", "]\n", "\n", "# Append the data to the `address_book.tsv` file" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Exercise 5: \n", "Create a function `load_tabular_data(filename, delimiter)` that receives the filename and a delimiter as input parameters, and returns the file content as a list of lists. Then we can call this function for CSV (with a comma as an argument) and for TSV (with a tabulator as an argument). Also make sure that `delimiter` is a keyword parameter, with a defaul value ','." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Create your function here\n", "\n", " \n", "# Now let's test the function\n", "csv_filename = \"../Data/csv_data/baby_names/names_by_state/AK.csv\"\n" ], "outputs": [], "metadata": {} } ], "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3.8.10 64-bit ('nlp': conda)" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "interpreter": { "hash": "6106edc083458b68f61c14c570e0f5152b4e1e25a61780539c3fe413e38ae5e6" } }, "nbformat": 4, "nbformat_minor": 4 }