{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "fjEICkK_X_8i" }, "source": [ "# Reading CSV Files" ] }, { "cell_type": "markdown", "metadata": { "id": "f04C6PbbX_9B" }, "source": [ "Comma-separated Values (CSV) are the most common text-based file format for sharing geospatial data. The structure of the file is 1 data record per line, with individual *columns* separated by a comma. \n", "\n", "In general, the separator character is called a delimiter. Other popular delimiters include the tab (\\\\t), colon (:) and semi-colon (;) characters. \n", "\n", "Reading CSV file properly requires us to know which delimiter is being used, along with quote character to surround the field values that contain space of the delimiter character. Since reading delimited text file is a very common operation, and can be tricky to handle all the corner cases, Python comes with its own library called `csv` for easy reading and writing of CSV files. To use it, you just have to import it.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oPoaAWrGX_9G" }, "outputs": [], "source": [ "import csv" ] }, { "cell_type": "markdown", "metadata": { "id": "kr-wOGufX_9J" }, "source": [ "The preferred way to read CSV files is using the `DictReader()` method. Which directly reads each row and creates a dictionary from it - with column names as *key* and column values as *value*. Let's see how to read a file using the `csv.DictReader()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U5y1klXEX_9K" }, "outputs": [], "source": [ "import os\n", "data_pkg_path = 'data'\n", "filename = 'worldcities.csv'\n", "path = os.path.join(data_pkg_path, filename)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PnGJ9LovX_9L" }, "outputs": [], "source": [ "f = open(path, 'r')\n", "csv_reader = csv.DictReader(f, delimiter=',', quotechar='\"')\n", "print(csv_reader)\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "p43TmAcXX_9N" }, "source": [ "## Using `enumerate()` function\n", "\n", "When iterating over an object, many times we need a counter. We saw in the previous example, how to use a variable like `count` and increase it with every iteration. There is an easy way to do this using the built-in `enumerate()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CYePkICvX_9O" }, "outputs": [], "source": [ "cities = ['San Francisco', 'Los Angeles', 'New York', 'Atlanta']\n", "for x in enumerate(cities):\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "2SQQcOnkX_9P" }, "source": [ "We can use enumerate() on any iterable object and get a tuple with an index and the iterable value with each iteration. Let's use it to print the first 5 lines from the DictReader object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JMdaJQtfX_9Q" }, "outputs": [], "source": [ "f = open(path, 'r', encoding='utf-8')\n", "csv_reader = csv.DictReader(f, delimiter=',', quotechar='\"')\n", "for index, row in enumerate(csv_reader):\n", " print(row)\n", " if index == 4:\n", " break\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "VyY7n8nYX_9R" }, "source": [ "## Using `with` statement\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8ZIN3M19X_9S" }, "source": [ "The code for file handling requires we open a file, do something with the file object and then close the file. That is tedious and it is possible that you may forget to call `close()` on the file. If the code for processing encounters an error the file is not closed property, it may result in bugs - especially when writing files.\n", "\n", "The preferred way to work with file objects is using the `with` statement. It results in simpler and cleaer code - which also ensures file objects are closed properly in case of errors.\n", "\n", "As you see below, we open the file and use the file object `f` in a `with` statement. Python takes care of closing the file when the execution of code within the statement is complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vu9-IO38X_9T" }, "outputs": [], "source": [ "with open(path, 'r', encoding='utf-8') as f:\n", " csv_reader = csv.DictReader(f)" ] }, { "cell_type": "markdown", "metadata": { "id": "DMydfSKFX_9U" }, "source": [ "## Filtering rows\n", "\n", "We can use conditional statement while iterating over the rows, to select and process rows that meet certain criterial. Let's count how many cities from a particular country are present in the file.\n", "\n", "Replace the `home_country` variable with your home country below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rPti4RzWX_9V" }, "outputs": [], "source": [ "home_country = 'India'\n", "num_cities = 0\n", "\n", "with open(path, 'r', encoding='utf-8') as f:\n", " csv_reader = csv.DictReader(f)\n", "\n", " for row in csv_reader:\n", " if row['country'] == home_country:\n", " num_cities += 1\n", " \n", "print(num_cities)" ] }, { "cell_type": "markdown", "metadata": { "id": "s0HpZM5tX_9W" }, "source": [ "## Calculating distance\n", "\n", "Let's apply the skills we have learnt so far to solve a complete problem. We want to read the `worldcities.csv` file, find all cities within a home country, calculate the distance to each cities from a home city and write the results to a new CSV file." ] }, { "cell_type": "markdown", "metadata": { "id": "PtQ1YIb1X_9W" }, "source": [ "First we find the coordinates of the out selected `home_city` from the file. Replace the `home_city` below with your hometown or a large city within your country. Note that we are using the `city_ascii` field for city name comparison, so make sure the `home_city` variable contains the ASCII version of the city name." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RxAciTfdX_9W" }, "outputs": [], "source": [ "home_city = 'Bengaluru'\n", "\n", "home_city_coordinates = ()\n", "\n", "with open(path, 'r', encoding='utf-8') as f:\n", " csv_reader = csv.DictReader(f)\n", " for row in csv_reader:\n", " if row['city_ascii'] == home_city:\n", " lat = row['lat']\n", " lng = row['lng']\n", " home_city_coordinates = (lat, lng)\n", " break\n", " \n", "print(home_city_coordinates)" ] }, { "cell_type": "markdown", "metadata": { "id": "LKm3cno5X_9W" }, "source": [ "Now we can loop through the file, find a city in the chosen home country and call the `geopy.distance.geodesic()` function to calculate the distance. In the code below, we are just computing first 5 matches." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1pirEpD0X_9X" }, "outputs": [], "source": [ "from geopy import distance\n", "\n", "counter = 0\n", "with open(path, 'r', encoding='utf-8') as f:\n", " csv_reader = csv.DictReader(f)\n", " for row in csv_reader:\n", " if (row['country'] == home_country and\n", " row['city_ascii'] != home_city):\n", " city_coordinates = (row['lat'], row['lng'])\n", " city_distance = distance.geodesic(\n", " city_coordinates, home_city_coordinates).km\n", " print(row['city_ascii'], city_distance)\n", " counter += 1\n", " \n", " if counter == 5:\n", " break\n", " " ] }, { "cell_type": "markdown", "metadata": { "id": "fsgJ_oZxX_9X" }, "source": [ "## Writing files\n", "\n", "Instead of printing the results, let's write the results to a new file. Similar to csv.DictReader(), there is a companion `csv.DictWriter()` method to write files. We create a `csv_writer` object and then write rows to it using the `writerow()` method." ] }, { "cell_type": "markdown", "metadata": { "id": "SKz12_N0X_9X" }, "source": [ "First we create an `output` folder to save the results. We can first check if the folder exists and if it doesn't exist, we can create it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bT5z7rmOX_9X" }, "outputs": [], "source": [ "output_dir = 'output'\n", "if not os.path.exists(output_dir):\n", " os.mkdir(output_dir)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NAn7zE7_X_9Y" }, "outputs": [], "source": [ "output_filename = 'cities_distance.csv'\n", "output_path = os.path.join(output_dir, output_filename)\n", "\n", "with open(output_path, mode='w', newline='', encoding='utf-8') as output_file:\n", " fieldnames = ['city', 'distance_from_home']\n", " csv_writer = csv.DictWriter(output_file, fieldnames=fieldnames)\n", " csv_writer.writeheader()\n", " \n", " # Now we read the input file, calculate distance and\n", " # write a row to the output \n", " with open(path, 'r', encoding='utf-8') as f:\n", " csv_reader = csv.DictReader(f)\n", " for row in csv_reader:\n", " if (row['country'] == home_country and\n", " row['city_ascii'] != home_city):\n", " city_coordinates = (row['lat'], row['lng'])\n", " city_distance = distance.geodesic(\n", " city_coordinates, home_city_coordinates).km\n", " csv_writer.writerow(\n", " {'city': row['city_ascii'],\n", " 'distance_from_home': city_distance}\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "_fmYgE0FX_9Y" }, "source": [ "Below is the complete code for our task of reading a file, filtering it, calculating distance and writing the results to a file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kK_ChJjAX_9Y" }, "outputs": [], "source": [ "import csv\n", "import os\n", "from geopy import distance\n", "\n", "data_pkg_path = 'data'\n", "input_filename = 'worldcities.csv'\n", "input_path = os.path.join(data_pkg_path, input_filename)\n", "output_filename = 'cities_distance.csv'\n", "output_dir = 'output'\n", "output_path = os.path.join(output_dir, output_filename)\n", "\n", "if not os.path.exists(output_dir):\n", " os.mkdir(output_dir)\n", " \n", "home_city = 'Bengaluru'\n", "home_country = 'India'\n", "\n", "with open(input_path, 'r', encoding='utf-8') as input_file:\n", " csv_reader = csv.DictReader(input_file)\n", " for row in csv_reader:\n", " if row['city_ascii'] == home_city:\n", " home_city_coordinates = (row['lat'], row['lng'])\n", " break\n", "\n", "with open(output_path, mode='w', newline='') as output_file:\n", " fieldnames = ['city', 'distance_from_home']\n", " csv_writer = csv.DictWriter(output_file, fieldnames=fieldnames)\n", " csv_writer.writeheader()\n", "\n", " with open(input_path, 'r', encoding='utf-8') as input_file:\n", " csv_reader = csv.DictReader(input_file)\n", " for row in csv_reader:\n", " if (row['country'] == home_country and\n", " row['city_ascii'] != home_city):\n", " city_coordinates = (row['lat'], row['lng'])\n", " city_distance = distance.geodesic(\n", " city_coordinates, home_city_coordinates).km\n", " csv_writer.writerow(\n", " {'city': row['city_ascii'],\n", " 'distance_from_home': city_distance}\n", " )\n", "\n", "print('Successfully written output file at {}'.format(output_path))" ] }, { "cell_type": "markdown", "metadata": { "id": "E8tFLpCnX_9Z" }, "source": [ "## Exercise\n", "\n", "Replace the `home_city` and `home_country` variables with your own home city and home country and create a CSV file containing distance from your home city to every other city in your country." ] }, { "cell_type": "markdown", "metadata": { "id": "aHHSUxyDX_9Z" }, "source": [ "----" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 0 }