{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### x lines of Python\n", "\n", "# Read and write CSV files\n", "\n", "This notebook accompanies the [agilescientific.com](https://agilescientific.com/) blog post of the same name:\n", "\n", "> [**x lines of Python: read and write CSV**](https://agilescientific.com/blog/2017/8/23/x-lines-of-python-read-and-write-csv)\n", "\n", "## Introduction\n", "\n", "CSV files are the de facto standard way to store data on the web. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in Python.\n", "\n", "Nine times out of ten, the way to read and write CSV files with Python is with [`pandas`](http://pandas.pydata.org/). We'll do that first. But that's not always an option (maybe you don't want the dependency in your code), so we'll do it lots of ways:\n", "\n", "- With `pandas`\n", "- 2 ways with the built-in package `csv`...\n", " - `csv.reader`\n", " - `csv.DictReader` \n", "- With `NumPy`.\n", "\n", "We'll also use `pandas` to read a couple of CSV files from the web (as opposed to from flat files on our computer)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `pandas`\n", "\n", "Using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fname = \"../data/periods.csv\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.read_csv(fname)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameabbreviationstartend
0CambrianC541.00485.40
1OrdovicianO485.40443.80
2SilurianS443.80419.20
3DevonianD419.20358.90
4MississipianMiss358.90323.20
5PennsylvanianPenn323.20298.90
6PermianP298.90252.20
7TriassicTr252.20201.30
8JurassicJ201.30145.00
9CretaceousK154.0066.00
10PalaeogenePg66.0020.03
11NeogeneNg23.032.58
12QuaternaryQ2.580.00
\n", "
" ], "text/plain": [ " name abbreviation start end\n", "0 Cambrian C 541.00 485.40\n", "1 Ordovician O 485.40 443.80\n", "2 Silurian S 443.80 419.20\n", "3 Devonian D 419.20 358.90\n", "4 Mississipian Miss 358.90 323.20\n", "5 Pennsylvanian Penn 323.20 298.90\n", "6 Permian P 298.90 252.20\n", "7 Triassic Tr 252.20 201.30\n", "8 Jurassic J 201.30 145.00\n", "9 Cretaceous K 154.00 66.00\n", "10 Palaeogene Pg 66.00 20.03\n", "11 Neogene Ng 23.03 2.58\n", "12 Quaternary Q 2.58 0.00" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get the start of the Permian like this:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6 298.9\n", "Name: start, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.name==\"Permian\"].start" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's fix the start of the Cretaceous:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "9 145.0\n", "Name: start, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df.name=='Cretaceous', 'start'] = 145.0\n", "df.loc[df.name=='Cretaceous', 'start']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you have changed or added to a DataFrame, `pandas` also makes it very easy to write a CSV file containing your data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.to_csv(\"../data/pdout.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `csv.reader`\n", "\n", "[Docs for the ordinary reader.](https://docs.python.org/3.5/library/csv.html#csv.reader)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import csv" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "with open(fname) as f:\n", " reader = csv.reader(f)\n", " data = [row for row in reader]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['name', 'abbreviation', 'start', 'end'],\n", " ['Cambrian', 'C', '541', '485.4'],\n", " ['Ordovician', 'O', '485.4', '443.8'],\n", " ['Silurian', 'S', '443.8', '419.2'],\n", " ['Devonian', 'D', '419.2', '358.9'],\n", " ['Mississipian', 'Miss', '358.9', '323.2'],\n", " ['Pennsylvanian', 'Penn', '323.2', '298.9'],\n", " ['Permian', 'P', '298.9', '252.2'],\n", " ['Triassic', 'Tr', '252.2', '201.3'],\n", " ['Jurassic', 'J', '201.3', '145'],\n", " ['Cretaceous', 'K', '154', '66'],\n", " ['Palaeogene', 'Pg', '66', '20.03'],\n", " ['Neogene', 'Ng', '23.03', '2.58'],\n", " ['Quaternary', 'Q', '2.58', '0.0']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['298.9']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[d[2] for d in data if d[0]==\"Permian\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we needed to know the positions of the items in the rows, which we could only get by inspection. We could skip that header row if we wanted to, but there's a better way: use the header as the keys in a dictionary..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `csv.DictReader`\n", "\n", "[Docs for the DictReader.](https://docs.python.org/3.5/library/csv.html#csv.DictReader)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "with open(fname) as f:\n", " reader = csv.DictReader(f)\n", " data = [row for row in reader]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{'abbreviation': 'C', 'end': '485.4', 'name': 'Cambrian', 'start': '541'},\n", " {'abbreviation': 'O', 'end': '443.8', 'name': 'Ordovician', 'start': '485.4'},\n", " {'abbreviation': 'S', 'end': '419.2', 'name': 'Silurian', 'start': '443.8'},\n", " {'abbreviation': 'D', 'end': '358.9', 'name': 'Devonian', 'start': '419.2'},\n", " {'abbreviation': 'Miss',\n", " 'end': '323.2',\n", " 'name': 'Mississipian',\n", " 'start': '358.9'},\n", " {'abbreviation': 'Penn',\n", " 'end': '298.9',\n", " 'name': 'Pennsylvanian',\n", " 'start': '323.2'},\n", " {'abbreviation': 'P', 'end': '252.2', 'name': 'Permian', 'start': '298.9'},\n", " {'abbreviation': 'Tr', 'end': '201.3', 'name': 'Triassic', 'start': '252.2'},\n", " {'abbreviation': 'J', 'end': '145', 'name': 'Jurassic', 'start': '201.3'},\n", " {'abbreviation': 'K', 'end': '66', 'name': 'Cretaceous', 'start': '154'},\n", " {'abbreviation': 'Pg', 'end': '20.03', 'name': 'Palaeogene', 'start': '66'},\n", " {'abbreviation': 'Ng', 'end': '2.58', 'name': 'Neogene', 'start': '23.03'},\n", " {'abbreviation': 'Q', 'end': '0.0', 'name': 'Quaternary', 'start': '2.58'}]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['298.9']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[d['start'] for d in data if d['name']==\"Permian\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a corresponding [`DictWriter`](https://docs.python.org/3.5/library/csv.html#csv.DictWriter) class for writing CSVs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `pandas` has lots of file readers, including ones for:\n", "\n", "- Excel files\n", "- JSON\n", "- SAS\n", "- Stata\n", "\n", "It can even read the clipboard!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus: reading a CSV file from the web\n", "\n", "For example, [this one](https://github.com/seg/2016-ml-contest/blob/master/training_data.csv) is hosted by GitHub. It's publicly readable, so we don't need to authenticate.\n", "\n", "In the [X Lines of Python: Machine Learning](04_Machine_learning.ipynb) notebook, I read the online file into a buffer, but it turns out you don't need to do this — you can just give `pd.read_csv()` a URL!" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameabbreviationstartend
0CambrianC541.0485.4
1OrdovicianO485.4443.8
2SilurianS443.8419.2
3DevonianD419.2358.9
4MississipianMiss358.9323.2
\n", "
" ], "text/plain": [ " name abbreviation start end\n", "0 Cambrian C 541.0 485.4\n", "1 Ordovician O 485.4 443.8\n", "2 Silurian S 443.8 419.2\n", "3 Devonian D 419.2 358.9\n", "4 Mississipian Miss 358.9 323.2" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import requests\n", "import io\n", "\n", "df = pd.read_csv('https://raw.githubusercontent.com/agile-geoscience/xlines/master/data/periods.csv')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus: using NumPy\n", "\n", "`pandas` is perfect for this CSV because it's really a table, containing a mixture of data types (strings and floats).\n", "\n", "Nonetheless, we can read it as an array... I'm not really into 'named arrays', so I'll just read the two numeric columns.\n", "\n", "We'll use [`np.genfromtxt`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 541. , 485.4 ],\n", " [ 485.4 , 443.8 ],\n", " [ 443.8 , 419.2 ],\n", " [ 419.2 , 358.9 ],\n", " [ 358.9 , 323.2 ],\n", " [ 323.2 , 298.9 ],\n", " [ 298.9 , 252.2 ],\n", " [ 252.2 , 201.3 ],\n", " [ 201.3 , 145. ],\n", " [ 154. , 66. ],\n", " [ 66. , 20.03],\n", " [ 23.03, 2.58],\n", " [ 2.58, 0. ]])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.genfromtxt(fname, delimiter=',', skip_header=1, usecols=[2,3])\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can write a CSV like so:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.savetxt(\"../data/npout.csv\", x, delimiter=\",\", header=\"start,end\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus: reading a CSV file from Google Docs\n", "\n", "It used to be easy to anonymously read a public file directly from Google Docs, but now you need an API key. It's not too hard to set up, but you'll need to [read some docs](https://developers.google.com/sheets/api/).\n", "\n", "When you have an API key, put it here..." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "key = \"PUT YOUR KEY HERE\"" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import json" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameabbreviationstartend
0CambrianC541485.4
1OrdovicianO485.4443.8
2SilurianS443.8419.2
3DevonianD419.2358.9
4MississipianMiss358.9323.2
\n", "
" ], "text/plain": [ " name abbreviation start end\n", "0 Cambrian C 541 485.4\n", "1 Ordovician O 485.4 443.8\n", "2 Silurian S 443.8 419.2\n", "3 Devonian D 419.2 358.9\n", "4 Mississipian Miss 358.9 323.2" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://sheets.googleapis.com/v4/spreadsheets/{id}/values/{sheet}\"\n", "\n", "meta = {\"id\": \"1YlnEGT8uHpRllk7rjAgFFl8V6B5-kl02DBie11PjG9Q\",\n", " \"sheet\": \"Sheet1\"\n", " }\n", "url = url.format(**meta)\n", "\n", "params = {\"key\": key}\n", "\n", "r = requests.get(url, params=params)\n", "j = json.loads(r.text)['values']\n", "df = pd.DataFrame(j[1:], columns=j[0])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }