{
"metadata": {
"name": "",
"signature": "sha256:c2665a7eac4e4ee7016fc4decebfb7b5a4b35ad315c92a754c6f4320ba711c92"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Reading and Writing Data in Text Format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parsing functions in pandas:\n",
"\n",
"read_csv: Load delimited file from a file, URL, or file-like object. Use comma as default delimiter.\n",
"\n",
"read_table: Load delimited data from a file, URL, or file-like object. Use tab ('\\t') as a default delimiter.\n",
"\n",
"read_fwf: Read data in fixed-width column format (that is, no limiters)\n",
"\n",
"read_clipboard: Version of read_Table that reads data from the clipboard. Useful for converting tables from webpages."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/ex1.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"a,b,c,d,message\r\n",
"1,2,3,4,hello\r\n",
"5,6,7,8,world\r\n",
"9,10,11,12,foo"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since this is comma-delimited, we can use read_csv to read it into a DataFrame:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex1.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could also have used read_table and specifiying the delimiter:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_table('/Users/sergulaydore/pydata-book/ch06/ex1.csv', sep = ',')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A file will not always a header file:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/ex2.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1,2,3,4,hello\r\n",
"5,6,7,8,world\r\n",
"9,10,11,12,foo"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"assign default column names"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', header = None)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
" 0 1 2 3 4\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or you can specify names yourself:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', names = ['a','b', 'c', 'd', 'message'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose you wanted the message column to be the index of the returned DataFrame:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"names = ['a', 'b', 'c', 'd', 'message']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', names = names, index_col = 'message')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" message | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" hello | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" world | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
" foo | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
" a b c d\n",
"message \n",
"hello 1 2 3 4\n",
"world 5 6 7 8\n",
"foo 9 10 11 12"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"to form hierarchical index from multiple columns"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/csv_mindex.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"key1,key2,value1,value2\r\n",
"one,a,1,2\r\n",
"one,b,3,4\r\n",
"one,c,5,6\r\n",
"one,d,7,8\r\n",
"two,a,9,10\r\n",
"two,b,11,12\r\n",
"two,c,13,14\r\n",
"two,d,15,16\r\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"parsed = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/csv_mindex.csv', index_col = ['key1','key2'])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"parsed"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" value1 | \n",
" value2 | \n",
"
\n",
" \n",
" key1 | \n",
" key2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" a | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" c | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" d | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
" two | \n",
" a | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" b | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
" c | \n",
" 13 | \n",
" 14 | \n",
"
\n",
" \n",
" d | \n",
" 15 | \n",
" 16 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
" value1 value2\n",
"key1 key2 \n",
"one a 1 2\n",
" b 3 4\n",
" c 5 6\n",
" d 7 8\n",
"two a 9 10\n",
" b 11 12\n",
" c 13 14\n",
" d 15 16"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In some cases, a table might not have a fixed delimiter"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"list(open('/Users/sergulaydore/pydata-book/ch06/ex3.txt')) "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"[' A B C\\n',\n",
" 'aaa -0.264438 -1.026059 -0.619500\\n',\n",
" 'bbb 0.927272 0.302904 -0.032399\\n',\n",
" 'ccc -0.264273 -0.386314 -0.217601\\n',\n",
" 'ddd -0.871858 -0.348382 1.100491\\n']"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case fields are separated by a variable amount of white space. This can be expressed by the regular expression \\s+."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result = pd.read_table('/Users/sergulaydore/pydata-book/ch06/ex3.txt', sep = '\\s+')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" aaa | \n",
" -0.264438 | \n",
" -1.026059 | \n",
" -0.619500 | \n",
"
\n",
" \n",
" bbb | \n",
" 0.927272 | \n",
" 0.302904 | \n",
" -0.032399 | \n",
"
\n",
" \n",
" ccc | \n",
" -0.264273 | \n",
" -0.386314 | \n",
" -0.217601 | \n",
"
\n",
" \n",
" ddd | \n",
" -0.871858 | \n",
" -0.348382 | \n",
" 1.100491 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
" A B C\n",
"aaa -0.264438 -1.026059 -0.619500\n",
"bbb 0.927272 0.302904 -0.032399\n",
"ccc -0.264273 -0.386314 -0.217601\n",
"ddd -0.871858 -0.348382 1.100491"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can skip the first, third and fourth rows of a file with skiprows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/ex4.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"# hey!\r\n",
"a,b,c,d,message\r\n",
"# just wanted to make things more difficult for you\r\n",
"# who reads CSV files with computers, anyway?\r\n",
"1,2,3,4,hello\r\n",
"5,6,7,8,world\r\n",
"9,10,11,12,foo"
]
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex4.csv', skiprows = [0,2,3])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occuring sentinels, such as NA, -1.#IND, and NULL:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/ex5.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"something,a,b,c,d,message\r\n",
"one,1,2,3,4,NA\r\n",
"two,5,6,,8,world\r\n",
"three,9,10,11,12,foo"
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" something | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" one | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" two | \n",
" 5 | \n",
" 6 | \n",
" NaN | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" three | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": [
" something a b c d message\n",
"0 one 1 2 3 4 NaN\n",
"1 two 5 6 NaN 8 world\n",
"2 three 9 10 11 12 foo"
]
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.isnull(result)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" something | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
" something a b c d message\n",
"0 False False False False False True\n",
"1 False False False True False False\n",
"2 False False False False False False"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The na_values option can take either a list or set of strings to consider missing values:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv', na_values = ['NULL'])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" something | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" one | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" two | \n",
" 5 | \n",
" 6 | \n",
" NaN | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" three | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 29,
"text": [
" something a b c d message\n",
"0 one 1 2 3 4 NaN\n",
"1 two 5 6 NaN 8 world\n",
"2 three 9 10 11 12 foo"
]
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Different NA sentinels can be specified for each column in a dict:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sentinels = {'message':['foo', 'NA'], 'something': ['two']}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv', na_values = sentinels)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" something | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" one | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" 5 | \n",
" 6 | \n",
" NaN | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" three | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
" something a b c d message\n",
"0 one 1 2 3 4 NaN\n",
"1 NaN 5 6 NaN 8 world\n",
"2 three 9 10 11 12 NaN"
]
}
],
"prompt_number": 31
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Reading Text Files in Pieces"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
" key | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.467976 | \n",
" -0.038649 | \n",
" -0.295344 | \n",
" -1.824726 | \n",
" L | \n",
"
\n",
" \n",
" 1 | \n",
" -0.358893 | \n",
" 1.404453 | \n",
" 0.704965 | \n",
" -0.200638 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" -0.501840 | \n",
" 0.659254 | \n",
" -0.421691 | \n",
" -0.057688 | \n",
" G | \n",
"
\n",
" \n",
" 3 | \n",
" 0.204886 | \n",
" 1.074134 | \n",
" 1.388361 | \n",
" -0.982404 | \n",
" R | \n",
"
\n",
" \n",
" 4 | \n",
" 0.354628 | \n",
" -0.133116 | \n",
" 0.283763 | \n",
" -0.837063 | \n",
" Q | \n",
"
\n",
" \n",
" 5 | \n",
" 1.817480 | \n",
" 0.742273 | \n",
" 0.419395 | \n",
" -2.251035 | \n",
" Q | \n",
"
\n",
" \n",
" 6 | \n",
" -0.776764 | \n",
" 0.935518 | \n",
" -0.332872 | \n",
" -1.875641 | \n",
" U | \n",
"
\n",
" \n",
" 7 | \n",
" -0.913135 | \n",
" 1.530624 | \n",
" -0.572657 | \n",
" 0.477252 | \n",
" K | \n",
"
\n",
" \n",
" 8 | \n",
" 0.358480 | \n",
" -0.497572 | \n",
" -0.367016 | \n",
" 0.507702 | \n",
" S | \n",
"
\n",
" \n",
" 9 | \n",
" -1.740877 | \n",
" -1.160417 | \n",
" -1.637830 | \n",
" 2.172201 | \n",
" G | \n",
"
\n",
" \n",
" 10 | \n",
" 0.240564 | \n",
" -0.328249 | \n",
" 1.252155 | \n",
" 1.072796 | \n",
" 8 | \n",
"
\n",
" \n",
" 11 | \n",
" 0.764018 | \n",
" 1.165476 | \n",
" -0.639544 | \n",
" 1.495258 | \n",
" R | \n",
"
\n",
" \n",
" 12 | \n",
" 0.571035 | \n",
" -0.310537 | \n",
" 0.582437 | \n",
" -0.298765 | \n",
" 1 | \n",
"
\n",
" \n",
" 13 | \n",
" 2.317658 | \n",
" 0.430710 | \n",
" -1.334216 | \n",
" 0.199679 | \n",
" P | \n",
"
\n",
" \n",
" 14 | \n",
" 1.547771 | \n",
" -1.119753 | \n",
" -2.277634 | \n",
" 0.329586 | \n",
" J | \n",
"
\n",
" \n",
" 15 | \n",
" -1.310608 | \n",
" 0.401719 | \n",
" -1.000987 | \n",
" 1.156708 | \n",
" E | \n",
"
\n",
" \n",
" 16 | \n",
" -0.088496 | \n",
" 0.634712 | \n",
" 0.153324 | \n",
" 0.415335 | \n",
" B | \n",
"
\n",
" \n",
" 17 | \n",
" -0.018663 | \n",
" -0.247487 | \n",
" -1.446522 | \n",
" 0.750938 | \n",
" A | \n",
"
\n",
" \n",
" 18 | \n",
" -0.070127 | \n",
" -1.579097 | \n",
" 0.120892 | \n",
" 0.671432 | \n",
" F | \n",
"
\n",
" \n",
" 19 | \n",
" -0.194678 | \n",
" -0.492039 | \n",
" 2.359605 | \n",
" 0.319810 | \n",
" H | \n",
"
\n",
" \n",
" 20 | \n",
" -0.248618 | \n",
" 0.868707 | \n",
" -0.492226 | \n",
" -0.717959 | \n",
" W | \n",
"
\n",
" \n",
" 21 | \n",
" -1.091549 | \n",
" -0.867110 | \n",
" -0.647760 | \n",
" -0.832562 | \n",
" C | \n",
"
\n",
" \n",
" 22 | \n",
" 0.641404 | \n",
" -0.138822 | \n",
" -0.621963 | \n",
" -0.284839 | \n",
" C | \n",
"
\n",
" \n",
" 23 | \n",
" 1.216408 | \n",
" 0.992687 | \n",
" 0.165162 | \n",
" -0.069619 | \n",
" V | \n",
"
\n",
" \n",
" 24 | \n",
" -0.564474 | \n",
" 0.792832 | \n",
" 0.747053 | \n",
" 0.571675 | \n",
" I | \n",
"
\n",
" \n",
" 25 | \n",
" 1.759879 | \n",
" -0.515666 | \n",
" -0.230481 | \n",
" 1.362317 | \n",
" S | \n",
"
\n",
" \n",
" 26 | \n",
" 0.126266 | \n",
" 0.309281 | \n",
" 0.382820 | \n",
" -0.239199 | \n",
" L | \n",
"
\n",
" \n",
" 27 | \n",
" 1.334360 | \n",
" -0.100152 | \n",
" -0.840731 | \n",
" -0.643967 | \n",
" 6 | \n",
"
\n",
" \n",
" 28 | \n",
" -0.737620 | \n",
" 0.278087 | \n",
" -0.053235 | \n",
" -0.950972 | \n",
" J | \n",
"
\n",
" \n",
" 29 | \n",
" -1.148486 | \n",
" -0.986292 | \n",
" -0.144963 | \n",
" 0.124362 | \n",
" Y | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 9970 | \n",
" 0.633495 | \n",
" -0.186524 | \n",
" 0.927627 | \n",
" 0.143164 | \n",
" 4 | \n",
"
\n",
" \n",
" 9971 | \n",
" 0.308636 | \n",
" -0.112857 | \n",
" 0.762842 | \n",
" -1.072977 | \n",
" 1 | \n",
"
\n",
" \n",
" 9972 | \n",
" -1.627051 | \n",
" -0.978151 | \n",
" 0.154745 | \n",
" -1.229037 | \n",
" Z | \n",
"
\n",
" \n",
" 9973 | \n",
" 0.314847 | \n",
" 0.097989 | \n",
" 0.199608 | \n",
" 0.955193 | \n",
" P | \n",
"
\n",
" \n",
" 9974 | \n",
" 1.666907 | \n",
" 0.992005 | \n",
" 0.496128 | \n",
" -0.686391 | \n",
" S | \n",
"
\n",
" \n",
" 9975 | \n",
" 0.010603 | \n",
" 0.708540 | \n",
" -1.258711 | \n",
" 0.226541 | \n",
" K | \n",
"
\n",
" \n",
" 9976 | \n",
" 0.118693 | \n",
" -0.714455 | \n",
" -0.501342 | \n",
" -0.254764 | \n",
" K | \n",
"
\n",
" \n",
" 9977 | \n",
" 0.302616 | \n",
" -2.011527 | \n",
" -0.628085 | \n",
" 0.768827 | \n",
" H | \n",
"
\n",
" \n",
" 9978 | \n",
" -0.098572 | \n",
" 1.769086 | \n",
" -0.215027 | \n",
" -0.053076 | \n",
" A | \n",
"
\n",
" \n",
" 9979 | \n",
" -0.019058 | \n",
" 1.964994 | \n",
" 0.738538 | \n",
" -0.883776 | \n",
" F | \n",
"
\n",
" \n",
" 9980 | \n",
" -0.595349 | \n",
" 0.001781 | \n",
" -1.423355 | \n",
" -1.458477 | \n",
" M | \n",
"
\n",
" \n",
" 9981 | \n",
" 1.392170 | \n",
" -1.396560 | \n",
" -1.425306 | \n",
" -0.847535 | \n",
" H | \n",
"
\n",
" \n",
" 9982 | \n",
" -0.896029 | \n",
" -0.152287 | \n",
" 1.924483 | \n",
" 0.365184 | \n",
" 6 | \n",
"
\n",
" \n",
" 9983 | \n",
" -2.274642 | \n",
" -0.901874 | \n",
" 1.500352 | \n",
" 0.996541 | \n",
" N | \n",
"
\n",
" \n",
" 9984 | \n",
" -0.301898 | \n",
" 1.019906 | \n",
" 1.102160 | \n",
" 2.624526 | \n",
" I | \n",
"
\n",
" \n",
" 9985 | \n",
" -2.548389 | \n",
" -0.585374 | \n",
" 1.496201 | \n",
" -0.718815 | \n",
" D | \n",
"
\n",
" \n",
" 9986 | \n",
" -0.064588 | \n",
" 0.759292 | \n",
" -1.568415 | \n",
" -0.420933 | \n",
" E | \n",
"
\n",
" \n",
" 9987 | \n",
" -0.143365 | \n",
" -1.111760 | \n",
" -1.815581 | \n",
" 0.435274 | \n",
" 2 | \n",
"
\n",
" \n",
" 9988 | \n",
" -0.070412 | \n",
" -1.055921 | \n",
" 0.338017 | \n",
" -0.440763 | \n",
" X | \n",
"
\n",
" \n",
" 9989 | \n",
" 0.649148 | \n",
" 0.994273 | \n",
" -1.384227 | \n",
" 0.485120 | \n",
" Q | \n",
"
\n",
" \n",
" 9990 | \n",
" -0.370769 | \n",
" 0.404356 | \n",
" -1.051628 | \n",
" -1.050899 | \n",
" 8 | \n",
"
\n",
" \n",
" 9991 | \n",
" -0.409980 | \n",
" 0.155627 | \n",
" -0.818990 | \n",
" 1.277350 | \n",
" W | \n",
"
\n",
" \n",
" 9992 | \n",
" 0.301214 | \n",
" -1.111203 | \n",
" 0.668258 | \n",
" 0.671922 | \n",
" A | \n",
"
\n",
" \n",
" 9993 | \n",
" 1.821117 | \n",
" 0.416445 | \n",
" 0.173874 | \n",
" 0.505118 | \n",
" X | \n",
"
\n",
" \n",
" 9994 | \n",
" 0.068804 | \n",
" 1.322759 | \n",
" 0.802346 | \n",
" 0.223618 | \n",
" H | \n",
"
\n",
" \n",
" 9995 | \n",
" 2.311896 | \n",
" -0.417070 | \n",
" -1.409599 | \n",
" -0.515821 | \n",
" L | \n",
"
\n",
" \n",
" 9996 | \n",
" -0.479893 | \n",
" -0.650419 | \n",
" 0.745152 | \n",
" -0.646038 | \n",
" E | \n",
"
\n",
" \n",
" 9997 | \n",
" 0.523331 | \n",
" 0.787112 | \n",
" 0.486066 | \n",
" 1.093156 | \n",
" K | \n",
"
\n",
" \n",
" 9998 | \n",
" -0.362559 | \n",
" 0.598894 | \n",
" -1.843201 | \n",
" 0.887292 | \n",
" G | \n",
"
\n",
" \n",
" 9999 | \n",
" -0.096376 | \n",
" -1.012999 | \n",
" -0.657431 | \n",
" -0.573315 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
10000 rows \u00d7 5 columns
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
" one two three four key\n",
"0 0.467976 -0.038649 -0.295344 -1.824726 L\n",
"1 -0.358893 1.404453 0.704965 -0.200638 B\n",
"2 -0.501840 0.659254 -0.421691 -0.057688 G\n",
"3 0.204886 1.074134 1.388361 -0.982404 R\n",
"4 0.354628 -0.133116 0.283763 -0.837063 Q\n",
"5 1.817480 0.742273 0.419395 -2.251035 Q\n",
"6 -0.776764 0.935518 -0.332872 -1.875641 U\n",
"7 -0.913135 1.530624 -0.572657 0.477252 K\n",
"8 0.358480 -0.497572 -0.367016 0.507702 S\n",
"9 -1.740877 -1.160417 -1.637830 2.172201 G\n",
"10 0.240564 -0.328249 1.252155 1.072796 8\n",
"11 0.764018 1.165476 -0.639544 1.495258 R\n",
"12 0.571035 -0.310537 0.582437 -0.298765 1\n",
"13 2.317658 0.430710 -1.334216 0.199679 P\n",
"14 1.547771 -1.119753 -2.277634 0.329586 J\n",
"15 -1.310608 0.401719 -1.000987 1.156708 E\n",
"16 -0.088496 0.634712 0.153324 0.415335 B\n",
"17 -0.018663 -0.247487 -1.446522 0.750938 A\n",
"18 -0.070127 -1.579097 0.120892 0.671432 F\n",
"19 -0.194678 -0.492039 2.359605 0.319810 H\n",
"20 -0.248618 0.868707 -0.492226 -0.717959 W\n",
"21 -1.091549 -0.867110 -0.647760 -0.832562 C\n",
"22 0.641404 -0.138822 -0.621963 -0.284839 C\n",
"23 1.216408 0.992687 0.165162 -0.069619 V\n",
"24 -0.564474 0.792832 0.747053 0.571675 I\n",
"25 1.759879 -0.515666 -0.230481 1.362317 S\n",
"26 0.126266 0.309281 0.382820 -0.239199 L\n",
"27 1.334360 -0.100152 -0.840731 -0.643967 6\n",
"28 -0.737620 0.278087 -0.053235 -0.950972 J\n",
"29 -1.148486 -0.986292 -0.144963 0.124362 Y\n",
"... ... ... ... ... ..\n",
"9970 0.633495 -0.186524 0.927627 0.143164 4\n",
"9971 0.308636 -0.112857 0.762842 -1.072977 1\n",
"9972 -1.627051 -0.978151 0.154745 -1.229037 Z\n",
"9973 0.314847 0.097989 0.199608 0.955193 P\n",
"9974 1.666907 0.992005 0.496128 -0.686391 S\n",
"9975 0.010603 0.708540 -1.258711 0.226541 K\n",
"9976 0.118693 -0.714455 -0.501342 -0.254764 K\n",
"9977 0.302616 -2.011527 -0.628085 0.768827 H\n",
"9978 -0.098572 1.769086 -0.215027 -0.053076 A\n",
"9979 -0.019058 1.964994 0.738538 -0.883776 F\n",
"9980 -0.595349 0.001781 -1.423355 -1.458477 M\n",
"9981 1.392170 -1.396560 -1.425306 -0.847535 H\n",
"9982 -0.896029 -0.152287 1.924483 0.365184 6\n",
"9983 -2.274642 -0.901874 1.500352 0.996541 N\n",
"9984 -0.301898 1.019906 1.102160 2.624526 I\n",
"9985 -2.548389 -0.585374 1.496201 -0.718815 D\n",
"9986 -0.064588 0.759292 -1.568415 -0.420933 E\n",
"9987 -0.143365 -1.111760 -1.815581 0.435274 2\n",
"9988 -0.070412 -1.055921 0.338017 -0.440763 X\n",
"9989 0.649148 0.994273 -1.384227 0.485120 Q\n",
"9990 -0.370769 0.404356 -1.051628 -1.050899 8\n",
"9991 -0.409980 0.155627 -0.818990 1.277350 W\n",
"9992 0.301214 -1.111203 0.668258 0.671922 A\n",
"9993 1.821117 0.416445 0.173874 0.505118 X\n",
"9994 0.068804 1.322759 0.802346 0.223618 H\n",
"9995 2.311896 -0.417070 -1.409599 -0.515821 L\n",
"9996 -0.479893 -0.650419 0.745152 -0.646038 E\n",
"9997 0.523331 0.787112 0.486066 1.093156 K\n",
"9998 -0.362559 0.598894 -1.843201 0.887292 G\n",
"9999 -0.096376 -1.012999 -0.657431 -0.573315 0\n",
"\n",
"[10000 rows x 5 columns]"
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read out a small number of rows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv', nrows = 5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
" key | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.467976 | \n",
" -0.038649 | \n",
" -0.295344 | \n",
" -1.824726 | \n",
" L | \n",
"
\n",
" \n",
" 1 | \n",
" -0.358893 | \n",
" 1.404453 | \n",
" 0.704965 | \n",
" -0.200638 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" -0.501840 | \n",
" 0.659254 | \n",
" -0.421691 | \n",
" -0.057688 | \n",
" G | \n",
"
\n",
" \n",
" 3 | \n",
" 0.204886 | \n",
" 1.074134 | \n",
" 1.388361 | \n",
" -0.982404 | \n",
" R | \n",
"
\n",
" \n",
" 4 | \n",
" 0.354628 | \n",
" -0.133116 | \n",
" 0.283763 | \n",
" -0.837063 | \n",
" Q | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 34,
"text": [
" one two three four key\n",
"0 0.467976 -0.038649 -0.295344 -1.824726 L\n",
"1 -0.358893 1.404453 0.704965 -0.200638 B\n",
"2 -0.501840 0.659254 -0.421691 -0.057688 G\n",
"3 0.204886 1.074134 1.388361 -0.982404 R\n",
"4 0.354628 -0.133116 0.283763 -0.837063 Q"
]
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read out file in pieces:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"chunker = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv', chunksize = 1000)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"chunker"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 36,
"text": [
""
]
}
],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"next(iter(chunker))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
" four | \n",
" key | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.467976 | \n",
" -0.038649 | \n",
" -0.295344 | \n",
" -1.824726 | \n",
" L | \n",
"
\n",
" \n",
" 1 | \n",
" -0.358893 | \n",
" 1.404453 | \n",
" 0.704965 | \n",
" -0.200638 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" -0.501840 | \n",
" 0.659254 | \n",
" -0.421691 | \n",
" -0.057688 | \n",
" G | \n",
"
\n",
" \n",
" 3 | \n",
" 0.204886 | \n",
" 1.074134 | \n",
" 1.388361 | \n",
" -0.982404 | \n",
" R | \n",
"
\n",
" \n",
" 4 | \n",
" 0.354628 | \n",
" -0.133116 | \n",
" 0.283763 | \n",
" -0.837063 | \n",
" Q | \n",
"
\n",
" \n",
" 5 | \n",
" 1.817480 | \n",
" 0.742273 | \n",
" 0.419395 | \n",
" -2.251035 | \n",
" Q | \n",
"
\n",
" \n",
" 6 | \n",
" -0.776764 | \n",
" 0.935518 | \n",
" -0.332872 | \n",
" -1.875641 | \n",
" U | \n",
"
\n",
" \n",
" 7 | \n",
" -0.913135 | \n",
" 1.530624 | \n",
" -0.572657 | \n",
" 0.477252 | \n",
" K | \n",
"
\n",
" \n",
" 8 | \n",
" 0.358480 | \n",
" -0.497572 | \n",
" -0.367016 | \n",
" 0.507702 | \n",
" S | \n",
"
\n",
" \n",
" 9 | \n",
" -1.740877 | \n",
" -1.160417 | \n",
" -1.637830 | \n",
" 2.172201 | \n",
" G | \n",
"
\n",
" \n",
" 10 | \n",
" 0.240564 | \n",
" -0.328249 | \n",
" 1.252155 | \n",
" 1.072796 | \n",
" 8 | \n",
"
\n",
" \n",
" 11 | \n",
" 0.764018 | \n",
" 1.165476 | \n",
" -0.639544 | \n",
" 1.495258 | \n",
" R | \n",
"
\n",
" \n",
" 12 | \n",
" 0.571035 | \n",
" -0.310537 | \n",
" 0.582437 | \n",
" -0.298765 | \n",
" 1 | \n",
"
\n",
" \n",
" 13 | \n",
" 2.317658 | \n",
" 0.430710 | \n",
" -1.334216 | \n",
" 0.199679 | \n",
" P | \n",
"
\n",
" \n",
" 14 | \n",
" 1.547771 | \n",
" -1.119753 | \n",
" -2.277634 | \n",
" 0.329586 | \n",
" J | \n",
"
\n",
" \n",
" 15 | \n",
" -1.310608 | \n",
" 0.401719 | \n",
" -1.000987 | \n",
" 1.156708 | \n",
" E | \n",
"
\n",
" \n",
" 16 | \n",
" -0.088496 | \n",
" 0.634712 | \n",
" 0.153324 | \n",
" 0.415335 | \n",
" B | \n",
"
\n",
" \n",
" 17 | \n",
" -0.018663 | \n",
" -0.247487 | \n",
" -1.446522 | \n",
" 0.750938 | \n",
" A | \n",
"
\n",
" \n",
" 18 | \n",
" -0.070127 | \n",
" -1.579097 | \n",
" 0.120892 | \n",
" 0.671432 | \n",
" F | \n",
"
\n",
" \n",
" 19 | \n",
" -0.194678 | \n",
" -0.492039 | \n",
" 2.359605 | \n",
" 0.319810 | \n",
" H | \n",
"
\n",
" \n",
" 20 | \n",
" -0.248618 | \n",
" 0.868707 | \n",
" -0.492226 | \n",
" -0.717959 | \n",
" W | \n",
"
\n",
" \n",
" 21 | \n",
" -1.091549 | \n",
" -0.867110 | \n",
" -0.647760 | \n",
" -0.832562 | \n",
" C | \n",
"
\n",
" \n",
" 22 | \n",
" 0.641404 | \n",
" -0.138822 | \n",
" -0.621963 | \n",
" -0.284839 | \n",
" C | \n",
"
\n",
" \n",
" 23 | \n",
" 1.216408 | \n",
" 0.992687 | \n",
" 0.165162 | \n",
" -0.069619 | \n",
" V | \n",
"
\n",
" \n",
" 24 | \n",
" -0.564474 | \n",
" 0.792832 | \n",
" 0.747053 | \n",
" 0.571675 | \n",
" I | \n",
"
\n",
" \n",
" 25 | \n",
" 1.759879 | \n",
" -0.515666 | \n",
" -0.230481 | \n",
" 1.362317 | \n",
" S | \n",
"
\n",
" \n",
" 26 | \n",
" 0.126266 | \n",
" 0.309281 | \n",
" 0.382820 | \n",
" -0.239199 | \n",
" L | \n",
"
\n",
" \n",
" 27 | \n",
" 1.334360 | \n",
" -0.100152 | \n",
" -0.840731 | \n",
" -0.643967 | \n",
" 6 | \n",
"
\n",
" \n",
" 28 | \n",
" -0.737620 | \n",
" 0.278087 | \n",
" -0.053235 | \n",
" -0.950972 | \n",
" J | \n",
"
\n",
" \n",
" 29 | \n",
" -1.148486 | \n",
" -0.986292 | \n",
" -0.144963 | \n",
" 0.124362 | \n",
" Y | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 970 | \n",
" 0.633495 | \n",
" -0.186524 | \n",
" 0.927627 | \n",
" 0.143164 | \n",
" P | \n",
"
\n",
" \n",
" 971 | \n",
" 0.308636 | \n",
" -0.112857 | \n",
" 0.762842 | \n",
" -1.072977 | \n",
" X | \n",
"
\n",
" \n",
" 972 | \n",
" -1.627051 | \n",
" -0.978151 | \n",
" 0.154745 | \n",
" -1.229037 | \n",
" O | \n",
"
\n",
" \n",
" 973 | \n",
" 0.314847 | \n",
" 0.097989 | \n",
" 0.199608 | \n",
" 0.955193 | \n",
" R | \n",
"
\n",
" \n",
" 974 | \n",
" 1.666907 | \n",
" 0.992005 | \n",
" 0.496128 | \n",
" -0.686391 | \n",
" W | \n",
"
\n",
" \n",
" 975 | \n",
" 0.010603 | \n",
" 0.708540 | \n",
" -1.258711 | \n",
" 0.226541 | \n",
" O | \n",
"
\n",
" \n",
" 976 | \n",
" 0.118693 | \n",
" -0.714455 | \n",
" -0.501342 | \n",
" -0.254764 | \n",
" S | \n",
"
\n",
" \n",
" 977 | \n",
" 0.302616 | \n",
" -2.011527 | \n",
" -0.628085 | \n",
" 0.768827 | \n",
" R | \n",
"
\n",
" \n",
" 978 | \n",
" -0.098572 | \n",
" 1.769086 | \n",
" -0.215027 | \n",
" -0.053076 | \n",
" G | \n",
"
\n",
" \n",
" 979 | \n",
" -0.019058 | \n",
" 1.964994 | \n",
" 0.738538 | \n",
" -0.883776 | \n",
" X | \n",
"
\n",
" \n",
" 980 | \n",
" -0.595349 | \n",
" 0.001781 | \n",
" -1.423355 | \n",
" -1.458477 | \n",
" O | \n",
"
\n",
" \n",
" 981 | \n",
" 1.392170 | \n",
" -1.396560 | \n",
" -1.425306 | \n",
" -0.847535 | \n",
" K | \n",
"
\n",
" \n",
" 982 | \n",
" -0.896029 | \n",
" -0.152287 | \n",
" 1.924483 | \n",
" 0.365184 | \n",
" M | \n",
"
\n",
" \n",
" 983 | \n",
" -2.274642 | \n",
" -0.901874 | \n",
" 1.500352 | \n",
" 0.996541 | \n",
" Q | \n",
"
\n",
" \n",
" 984 | \n",
" -0.301898 | \n",
" 1.019906 | \n",
" 1.102160 | \n",
" 2.624526 | \n",
" Q | \n",
"
\n",
" \n",
" 985 | \n",
" -2.548389 | \n",
" -0.585374 | \n",
" 1.496201 | \n",
" -0.718815 | \n",
" R | \n",
"
\n",
" \n",
" 986 | \n",
" -0.064588 | \n",
" 0.759292 | \n",
" -1.568415 | \n",
" -0.420933 | \n",
" N | \n",
"
\n",
" \n",
" 987 | \n",
" -0.143365 | \n",
" -1.111760 | \n",
" -1.815581 | \n",
" 0.435274 | \n",
" C | \n",
"
\n",
" \n",
" 988 | \n",
" -0.070412 | \n",
" -1.055921 | \n",
" 0.338017 | \n",
" -0.440763 | \n",
" X | \n",
"
\n",
" \n",
" 989 | \n",
" 0.649148 | \n",
" 0.994273 | \n",
" -1.384227 | \n",
" 0.485120 | \n",
" I | \n",
"
\n",
" \n",
" 990 | \n",
" -0.370769 | \n",
" 0.404356 | \n",
" -1.051628 | \n",
" -1.050899 | \n",
" V | \n",
"
\n",
" \n",
" 991 | \n",
" -0.409980 | \n",
" 0.155627 | \n",
" -0.818990 | \n",
" 1.277350 | \n",
" 4 | \n",
"
\n",
" \n",
" 992 | \n",
" 0.301214 | \n",
" -1.111203 | \n",
" 0.668258 | \n",
" 0.671922 | \n",
" Q | \n",
"
\n",
" \n",
" 993 | \n",
" 1.821117 | \n",
" 0.416445 | \n",
" 0.173874 | \n",
" 0.505118 | \n",
" O | \n",
"
\n",
" \n",
" 994 | \n",
" 0.068804 | \n",
" 1.322759 | \n",
" 0.802346 | \n",
" 0.223618 | \n",
" W | \n",
"
\n",
" \n",
" 995 | \n",
" 2.311896 | \n",
" -0.417070 | \n",
" -1.409599 | \n",
" -0.515821 | \n",
" M | \n",
"
\n",
" \n",
" 996 | \n",
" -0.479893 | \n",
" -0.650419 | \n",
" 0.745152 | \n",
" -0.646038 | \n",
" H | \n",
"
\n",
" \n",
" 997 | \n",
" 0.523331 | \n",
" 0.787112 | \n",
" 0.486066 | \n",
" 1.093156 | \n",
" D | \n",
"
\n",
" \n",
" 998 | \n",
" -0.362559 | \n",
" 0.598894 | \n",
" -1.843201 | \n",
" 0.887292 | \n",
" W | \n",
"
\n",
" \n",
" 999 | \n",
" -0.096376 | \n",
" -1.012999 | \n",
" -0.657431 | \n",
" -0.573315 | \n",
" K | \n",
"
\n",
" \n",
"
\n",
"
1000 rows \u00d7 5 columns
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 38,
"text": [
" one two three four key\n",
"0 0.467976 -0.038649 -0.295344 -1.824726 L\n",
"1 -0.358893 1.404453 0.704965 -0.200638 B\n",
"2 -0.501840 0.659254 -0.421691 -0.057688 G\n",
"3 0.204886 1.074134 1.388361 -0.982404 R\n",
"4 0.354628 -0.133116 0.283763 -0.837063 Q\n",
"5 1.817480 0.742273 0.419395 -2.251035 Q\n",
"6 -0.776764 0.935518 -0.332872 -1.875641 U\n",
"7 -0.913135 1.530624 -0.572657 0.477252 K\n",
"8 0.358480 -0.497572 -0.367016 0.507702 S\n",
"9 -1.740877 -1.160417 -1.637830 2.172201 G\n",
"10 0.240564 -0.328249 1.252155 1.072796 8\n",
"11 0.764018 1.165476 -0.639544 1.495258 R\n",
"12 0.571035 -0.310537 0.582437 -0.298765 1\n",
"13 2.317658 0.430710 -1.334216 0.199679 P\n",
"14 1.547771 -1.119753 -2.277634 0.329586 J\n",
"15 -1.310608 0.401719 -1.000987 1.156708 E\n",
"16 -0.088496 0.634712 0.153324 0.415335 B\n",
"17 -0.018663 -0.247487 -1.446522 0.750938 A\n",
"18 -0.070127 -1.579097 0.120892 0.671432 F\n",
"19 -0.194678 -0.492039 2.359605 0.319810 H\n",
"20 -0.248618 0.868707 -0.492226 -0.717959 W\n",
"21 -1.091549 -0.867110 -0.647760 -0.832562 C\n",
"22 0.641404 -0.138822 -0.621963 -0.284839 C\n",
"23 1.216408 0.992687 0.165162 -0.069619 V\n",
"24 -0.564474 0.792832 0.747053 0.571675 I\n",
"25 1.759879 -0.515666 -0.230481 1.362317 S\n",
"26 0.126266 0.309281 0.382820 -0.239199 L\n",
"27 1.334360 -0.100152 -0.840731 -0.643967 6\n",
"28 -0.737620 0.278087 -0.053235 -0.950972 J\n",
"29 -1.148486 -0.986292 -0.144963 0.124362 Y\n",
".. ... ... ... ... ..\n",
"970 0.633495 -0.186524 0.927627 0.143164 P\n",
"971 0.308636 -0.112857 0.762842 -1.072977 X\n",
"972 -1.627051 -0.978151 0.154745 -1.229037 O\n",
"973 0.314847 0.097989 0.199608 0.955193 R\n",
"974 1.666907 0.992005 0.496128 -0.686391 W\n",
"975 0.010603 0.708540 -1.258711 0.226541 O\n",
"976 0.118693 -0.714455 -0.501342 -0.254764 S\n",
"977 0.302616 -2.011527 -0.628085 0.768827 R\n",
"978 -0.098572 1.769086 -0.215027 -0.053076 G\n",
"979 -0.019058 1.964994 0.738538 -0.883776 X\n",
"980 -0.595349 0.001781 -1.423355 -1.458477 O\n",
"981 1.392170 -1.396560 -1.425306 -0.847535 K\n",
"982 -0.896029 -0.152287 1.924483 0.365184 M\n",
"983 -2.274642 -0.901874 1.500352 0.996541 Q\n",
"984 -0.301898 1.019906 1.102160 2.624526 Q\n",
"985 -2.548389 -0.585374 1.496201 -0.718815 R\n",
"986 -0.064588 0.759292 -1.568415 -0.420933 N\n",
"987 -0.143365 -1.111760 -1.815581 0.435274 C\n",
"988 -0.070412 -1.055921 0.338017 -0.440763 X\n",
"989 0.649148 0.994273 -1.384227 0.485120 I\n",
"990 -0.370769 0.404356 -1.051628 -1.050899 V\n",
"991 -0.409980 0.155627 -0.818990 1.277350 4\n",
"992 0.301214 -1.111203 0.668258 0.671922 Q\n",
"993 1.821117 0.416445 0.173874 0.505118 O\n",
"994 0.068804 1.322759 0.802346 0.223618 W\n",
"995 2.311896 -0.417070 -1.409599 -0.515821 M\n",
"996 -0.479893 -0.650419 0.745152 -0.646038 H\n",
"997 0.523331 0.787112 0.486066 1.093156 D\n",
"998 -0.362559 0.598894 -1.843201 0.887292 W\n",
"999 -0.096376 -1.012999 -0.657431 -0.573315 K\n",
"\n",
"[1000 rows x 5 columns]"
]
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tot = pd.Series([])\n",
"for piece in chunker:\n",
" tot = tot.add(piece['key'].value_counts(), fill_value = 0)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tot = tot.order(ascending = False)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tot[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 43,
"text": [
"E 336\n",
"X 327\n",
"L 315\n",
"M 309\n",
"K 303\n",
"Q 301\n",
"O 299\n",
"P 299\n",
"J 298\n",
"F 295\n",
"dtype: float64"
]
}
],
"prompt_number": 43
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Writing Data Out to Text Format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data can be exported to delimited format."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 47
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" something | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" one | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" two | \n",
" 5 | \n",
" 6 | \n",
" NaN | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" three | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 48,
"text": [
" something a b c d message\n",
"0 one 1 2 3 4 NaN\n",
"1 two 5 6 NaN 8 world\n",
"2 three 9 10 11 12 foo"
]
}
],
"prompt_number": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can write data out to a comma-separated file:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.to_csv('/Users/sergulaydore/pydata-book/ch06/out.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 49
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/out.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
",something,a,b,c,d,message\r\n",
"0,one,1,2,3.0,4,\r\n",
"1,two,5,6,,8,world\r\n",
"2,three,9,10,11.0,12,foo\r\n"
]
}
],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other delimiters can be used, of course (writing to sys.stdout so it just prints the text result)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sys\n",
"data.to_csv(sys.stdout, sep = '|')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"|something|a|b|c|d|message\n",
"0|one|1|2|3.0|4|\n",
"1|two|5|6||8|world\n",
"2|three|9|10|11.0|12|foo\n"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.to_csv(sys.stdout, na_rep = 'NULL')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
",something,a,b,c,d,message\n",
"0,one,1,2,3.0,4,NULL\n",
"1,two,5,6,NULL,8,world\n",
"2,three,9,10,11.0,12,foo\n"
]
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With no other options specified, both the row and column labels are written. Both of these can be disabled:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.to_csv(sys.stdout, index=False, header = False)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"one,1,2,3.0,4,\n",
"two,5,6,,8,world\n",
"three,9,10,11.0,12,foo\n"
]
}
],
"prompt_number": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also write only a subset of the columns, and in an order of your choosing:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.to_csv(sys.stdout, index = False, columns = ['a', 'b', 'c'])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"a,b,c\n",
"1,2,3.0\n",
"5,6,\n",
"9,10,11.0\n"
]
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Series also has a to_csv method:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dates = pd.date_range('1/1/2000', periods=7)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 57
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"ts = pd.Series(np.arange(7), index=dates)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 60
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ts.to_csv('/Users/sergulaydore/pydata-book/ch06/tseries.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 61
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/tseries.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"2000-01-01,0\r\n",
"2000-01-02,1\r\n",
"2000-01-03,2\r\n",
"2000-01-04,3\r\n",
"2000-01-05,4\r\n",
"2000-01-06,5\r\n",
"2000-01-07,6\r\n"
]
}
],
"prompt_number": 62
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can read a CSV version of Series with read_csv, but there is also a from_csv convenience method that makes it a bit simpler:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.Series.from_csv('/Users/sergulaydore/pydata-book/ch06/tseries.csv', parse_dates = True)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 68,
"text": [
"2000-01-01 0\n",
"2000-01-02 1\n",
"2000-01-03 2\n",
"2000-01-04 3\n",
"2000-01-05 4\n",
"2000-01-06 5\n",
"2000-01-07 6\n",
"dtype: int64"
]
}
],
"prompt_number": 68
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"help(pd.Series.from_csv)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Help on method from_csv in module pandas.core.series:\n",
"\n",
"from_csv(cls, path, sep=',', parse_dates=True, header=None, index_col=0, encoding=None, infer_datetime_format=False) method of __builtin__.type instance\n",
" Read delimited file into Series\n",
" \n",
" Parameters\n",
" ----------\n",
" path : string file path or file handle / StringIO\n",
" sep : string, default ','\n",
" Field delimiter\n",
" parse_dates : boolean, default True\n",
" Parse dates. Different default from read_table\n",
" header : int, default 0\n",
" Row to use at header (skip prior rows)\n",
" index_col : int or sequence, default 0\n",
" Column to use for index. If a sequence is given, a MultiIndex\n",
" is used. Different default from read_table\n",
" encoding : string, optional\n",
" a string representing the encoding to use if the contents are\n",
" non-ascii, for python versions prior to 3\n",
" infer_datetime_format: boolean, default False\n",
" If True and `parse_dates` is True for a column, try to infer the\n",
" datetime format based on the first datetime string. If the format\n",
" can be inferred, there often will be a large parsing speed-up.\n",
" \n",
" Returns\n",
" -------\n",
" y : Series\n",
"\n"
]
}
],
"prompt_number": 66
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Manually Working with Delimited Formats"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat /Users/sergulaydore/pydata-book/ch06/ex7.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\"a\",\"b\",\"c\"\r\n",
"\"1\",\"2\",\"3\"\r\n",
"\"1\",\"2\",\"3\",\"4\"\r\n"
]
}
],
"prompt_number": 69
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For any file with a single-character delimiter, you can use Python's built-in csv module. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"f = open('/Users/sergulaydore/pydata-book/ch06/ex7.csv')\n",
"reader = csv.reader(f)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 72
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Iterating through the reader like a file yields tuples of values in each like with any quote characters removed:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for line in reader:\n",
" print line"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['a', 'b', 'c']\n",
"['1', '2', '3']\n",
"['1', '2', '3', '4']\n"
]
}
],
"prompt_number": 73
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From there, it's up to you to do the wrangling necessary to put the data in the form that you need it. For example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lines = list(csv.reader(open('/Users/sergulaydore/pydata-book/ch06/ex7.csv')))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 76
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lines"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 77,
"text": [
"[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3', '4']]"
]
}
],
"prompt_number": 77
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"header, values = lines[0], lines[1:]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 78
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"header"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 79,
"text": [
"['a', 'b', 'c']"
]
}
],
"prompt_number": 79
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"values"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 80,
"text": [
"[['1', '2', '3'], ['1', '2', '3', '4']]"
]
}
],
"prompt_number": 80
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data_dict = {h: v for h, v in zip(header, zip(*values))}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 83
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data_dict"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 84,
"text": [
"{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}"
]
}
],
"prompt_number": 84
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CSV files come in many different flavors. Defining a new format with a differnt delimiter, string quoting convention, or line terminator is done by defining a simple sub-class of csv.Dialect:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class my_dialect(csv.Dialect):\n",
" lineterminator = '\\n'\n",
" delimiter = ';'\n",
" quotechar = '\"'\n",
" quoting=1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 90
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reader = csv.reader(f, dialect = my_dialect)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 91
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Individual CSV dialect parameters can also be given as keywords to csv.reader without having to define a subclass:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reader = csv.reader(f, delimiter='|')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 98
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To write delimited files manually, you can use csv.writer."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"with open('mydata.csv','w') as f:\n",
" writer = csv.writer(f, dialect=my_dialect)\n",
" writer.writerow(('one', 'two', 'three'))\n",
" writer.writerow(('1','2','3'))\n",
" writer.writerow(('4','5','6'))\n",
" writer.writerow(('7','8','9'))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 99
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat mydata.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\"one\";\"two\";\"three\"\r\n",
"\"1\";\"2\";\"3\"\r\n",
"\"4\";\"5\";\"6\"\r\n",
"\"7\";\"8\";\"9\"\r\n"
]
}
],
"prompt_number": 100
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"JSON data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"JSON (Short for JavaScript Object notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more flexible data format than a tabular text form like CSV. Here is an example."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"obj = \"\"\"\n",
"{\"name\": \"Wes\",\n",
"\"places_lived\": [\"United States\", \"Spain\", \"Germany\"],\n",
"\"pet\": null,\n",
"\"siblings\":[{\"name\": \"Scott\", \"age\": 25, \"pet\":\"Zuko\"},\n",
" {\"name\": \"Katie\", \"age\": 33, \"pet\": \"Cisco\"}]\n",
"}\n",
"\"\"\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 104
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To convert JSON string to Python form, use json.loads:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import json"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 105
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result = json.loads(obj)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 106
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"result"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 107,
"text": [
"{u'name': u'Wes',\n",
" u'pet': None,\n",
" u'places_lived': [u'United States', u'Spain', u'Germany'],\n",
" u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},\n",
" {u'age': 33, u'name': u'Katie', u'pet': u'Cisco'}]}"
]
}
],
"prompt_number": 107
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"json.dumps converts a Python object back to JSON:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"asjson = json.dumps(result)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 108
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass a list of JSON objects to the DataFrame constructor and select a subset of the data fields:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"siblings = pd.DataFrame(result['siblings'], columns = ['name','age'])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 109
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"siblings"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Scott | \n",
" 25 | \n",
"
\n",
" \n",
" 1 | \n",
" Katie | \n",
" 33 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 110,
"text": [
" name age\n",
"0 Scott 25\n",
"1 Katie 33"
]
}
],
"prompt_number": 110
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"XML and HTML: Web Scraping # THIS PART FAILED WHEN ATTEMPTED TO READ tables[9] and tables[13]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get started, find the URL you want to extract data from, open it with urllib2 and parse the stream with lxml like so:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from lxml.html import parse\n",
"from urllib2 import urlopen"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 111
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"help(parse)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Help on function parse in module lxml.html:\n",
"\n",
"parse(filename_or_url, parser=None, base_url=None, **kw)\n",
" Parse a filename, URL, or file-like object into an HTML document\n",
" tree. Note: this returns a tree, not an element. Use\n",
" ``parse(...).getroot()`` to get the document root.\n",
" \n",
" You can override the base URL with the ``base_url`` keyword. This\n",
" is most useful when parsing from a file-like object.\n",
"\n"
]
}
],
"prompt_number": 112
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"parsed = parse((urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options')))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 113
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"doc = parsed.getroot()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 117
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"doc"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 118,
"text": [
""
]
}
],
"prompt_number": 118
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using this object, you can extract all HTML tags of a particular type, such as table tags containing the data of interest. As a simple example, suppose you wanted to get a list of every URL linked to in the document; links are a tags in HTML. Using the document root's findall method along with an XPath (a means of expressing \"queries\" on the document):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"links = doc.findall('.//a')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 119
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"links[15:20]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 121,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 121
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But these are objects representing HTML elements; to get the URL and link text you have to use each element's get method (for the URL) and text content methos (for the display text)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lnk = links[28]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 122
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lnk.get('href')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 123,
"text": [
"'https://homes.yahoo.com/own-rent/'"
]
}
],
"prompt_number": 123
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lnk.text_content()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 124,
"text": [
"'Homes'"
]
}
],
"prompt_number": 124
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thus, getting a list of all URLs in the document is a matter of writing this list comprehension:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"urls = [lnk.get('href') for lnk in doc.findall('.//a')]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 125
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"urls[10:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 127,
"text": [
"['https://www.flickr.com/',\n",
" 'https://mobile.yahoo.com/',\n",
" 'http://everything.yahoo.com/',\n",
" 'https://celebrity.yahoo.com/',\n",
" 'https://www.yahoo.com/movies',\n",
" 'https://www.yahoo.com/music',\n",
" 'https://www.yahoo.com/tv',\n",
" 'https://groups.yahoo.com/',\n",
" 'https://www.yahoo.com/health',\n",
" 'https://www.yahoo.com/style',\n",
" 'https://www.yahoo.com/beauty',\n",
" 'https://www.yahoo.com/food',\n",
" 'https://www.yahoo.com/parenting',\n",
" 'https://www.yahoo.com/makers',\n",
" 'https://www.yahoo.com/tech',\n",
" 'http://shopping.yahoo.com/',\n",
" 'https://www.yahoo.com/travel',\n",
" 'https://autos.yahoo.com/',\n",
" 'https://homes.yahoo.com/own-rent/',\n",
" 'https://www.mozilla.org/firefox/new/?utm_source=yahoo&utm_medium=referral&utm_campaign=y-uh&utm_content=y-install-new-firefox',\n",
" 'http://finance.yahoo.com',\n",
" 'https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options',\n",
" 'https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options',\n",
" 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.src=ym',\n",
" 'javascript:void(0);',\n",
" 'https://edit.yahoo.com/mc2.0/eval_profile?.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options&.src=quote&.intl=us&.lang=en-US',\n",
" 'https://help.yahoo.com/l/us/yahoo/finance/',\n",
" 'http://feedback.yahoo.com/forums/207809',\n",
" 'https://www.yahoo.com/',\n",
" '/',\n",
" '/portfolios.html',\n",
" '/portfolios/manage',\n",
" '/portfolio/new',\n",
" '/my-quotes-news/',\n",
" '/market-overview/',\n",
" '/stock-center/',\n",
" '/funds/',\n",
" '/options/',\n",
" '/etf/',\n",
" '/bonds',\n",
" '/futures',\n",
" '/currency-investing',\n",
" 'http://biz.yahoo.com/research/earncal/today.html',\n",
" '/yahoofinance/',\n",
" '/yahoofinance/business/',\n",
" '/yahoofinance/investing',\n",
" '/yahoofinance/personalfinance',\n",
" '/blogs/breakout/',\n",
" '/blogs/cost-of-living/',\n",
" '/blogs/daily-ticker/',\n",
" '/blogs/driven/',\n",
" '/blogs/hot-stock-minute/',\n",
" '/blogs/just-explain-it/',\n",
" 'http://finance.yahoo.com/blogs/author/aaron-task/',\n",
" '/blogs/author/michael-santoli/',\n",
" '/blogs/author/jeff-macke/',\n",
" '/blogs/author/aaron-pressman/',\n",
" '/blogs/author/rick-newman/',\n",
" '/blogs/author/mandi-woodruff/',\n",
" '/blogs/author/chris-nichols/',\n",
" '/blogs/the-exchange/',\n",
" '/blogs/michael-santoli/',\n",
" 'http://finance.yahoo.com/blogs/author/philip-pearlman/',\n",
" '/news/',\n",
" '/corporate-news/',\n",
" '/economic-policy-news/',\n",
" '/investing-news/',\n",
" '/personal-finance/',\n",
" '/career-education/',\n",
" '/real-estate/',\n",
" '/retirement/',\n",
" '/credit-debt/',\n",
" '/taxes/',\n",
" '/autos/',\n",
" '/lifestyle/',\n",
" '/videos/',\n",
" '/rates/',\n",
" '/calculator/index/',\n",
" '/personal-finance/tools/',\n",
" '/cnbc/',\n",
" '/blogs/big-data-download/',\n",
" '/blogs/off-the-cuff/',\n",
" '/blogs/power-pitch/',\n",
" '/blogs/talking-numbers/',\n",
" '/blogs/the-biz-fix/',\n",
" '/blogs/top-best-most/',\n",
" '/contributors/',\n",
" 'http://finance.search.yahoo.com?fr=fin-v1',\n",
" '/q?s=^DJI',\n",
" '/q?s=^IXIC',\n",
" '/q?s=AAPL',\n",
" '/q/ecn?s=AAPL+Order+Book',\n",
" '/q/op?s=AAPL+Options',\n",
" '/q/hp?s=AAPL+Historical+Prices',\n",
" '/echarts?s=AAPL+Interactive',\n",
" '/q/h?s=AAPL+Headlines',\n",
" '/q/p?s=AAPL+Press+Releases',\n",
" '/q/ce?s=AAPL+Company+Events',\n",
" '/mb?s=AAPL',\n",
" '/marketpulse/?s=AAPL',\n",
" '/q/pr?s=AAPL+Profile',\n",
" '/q/ks?s=AAPL+Key+Statistics',\n",
" '/q/sec?s=AAPL+SEC+Filings',\n",
" '/q/co?s=AAPL+Competitors',\n",
" '/q/in?s=AAPL+Industry',\n",
" '/q/ct?s=AAPL+Components',\n",
" '/q/ao?s=AAPL+Analyst+Opinion',\n",
" '/q/ae?s=AAPL+Analyst+Estimates',\n",
" '/q/mh?s=AAPL+Major+Holders',\n",
" '/q/it?s=AAPL+Insider+Transactions',\n",
" '/q/ir?s=AAPL+Insider+Roster',\n",
" '/q/is?s=AAPL+Income+Statement',\n",
" '/q/bs?s=AAPL+Balance+Sheet',\n",
" '/q/cf?s=AAPL+Cash+Flow',\n",
" 'https://mobile.yahoo.com/finance/?src=gta',\n",
" '/q/op?s=AAPL&date=1426204800',\n",
" '/q/op?s=AAPL&straddle=true&date=1426204800',\n",
" None,\n",
" None,\n",
" '/q/op?s=AAPL&strike=85.00',\n",
" '/q?s=AAPL150313C00085000',\n",
" '/q/op?s=AAPL&strike=90.00',\n",
" '/q?s=AAPL150313C00090000',\n",
" '/q/op?s=AAPL&strike=95.00',\n",
" '/q?s=AAPL150313C00095000',\n",
" '/q/op?s=AAPL&strike=100.00',\n",
" '/q?s=AAPL150313C00100000',\n",
" '/q/op?s=AAPL&strike=101.00',\n",
" '/q?s=AAPL150313C00101000',\n",
" '/q/op?s=AAPL&strike=102.00',\n",
" '/q?s=AAPL150313C00102000',\n",
" '/q/op?s=AAPL&strike=103.00',\n",
" '/q?s=AAPL150313C00103000',\n",
" '/q/op?s=AAPL&strike=105.00',\n",
" '/q?s=AAPL150313C00105000',\n",
" '/q/op?s=AAPL&strike=106.00',\n",
" '/q?s=AAPL150313C00106000',\n",
" '/q/op?s=AAPL&strike=107.00',\n",
" '/q?s=AAPL150313C00107000',\n",
" '/q/op?s=AAPL&strike=108.00',\n",
" '/q?s=AAPL150313C00108000',\n",
" '/q/op?s=AAPL&strike=109.00',\n",
" '/q?s=AAPL150313C00109000',\n",
" '/q/op?s=AAPL&strike=110.00',\n",
" '/q?s=AAPL150313C00110000',\n",
" '/q/op?s=AAPL&strike=111.00',\n",
" '/q?s=AAPL150313C00111000',\n",
" '/q/op?s=AAPL&strike=112.00',\n",
" '/q?s=AAPL150313C00112000',\n",
" '/q/op?s=AAPL&strike=113.00',\n",
" '/q?s=AAPL150313C00113000',\n",
" '/q/op?s=AAPL&strike=114.00',\n",
" '/q?s=AAPL150313C00114000',\n",
" '/q/op?s=AAPL&strike=115.00',\n",
" '/q?s=AAPL150313C00115000',\n",
" '/q/op?s=AAPL&strike=116.00',\n",
" '/q?s=AAPL150313C00116000',\n",
" '/q/op?s=AAPL&strike=117.00',\n",
" '/q?s=AAPL150313C00117000',\n",
" '/q/op?s=AAPL&strike=118.00',\n",
" '/q?s=AAPL150313C00118000',\n",
" '/q/op?s=AAPL&strike=119.00',\n",
" '/q?s=AAPL150313C00119000',\n",
" '/q/op?s=AAPL&strike=120.00',\n",
" '/q?s=AAPL150313C00120000',\n",
" '/q/op?s=AAPL&strike=121.00',\n",
" '/q?s=AAPL150313C00121000',\n",
" '/q/op?s=AAPL&strike=122.00',\n",
" '/q?s=AAPL150313C00122000',\n",
" '/q/op?s=AAPL&strike=123.00',\n",
" '/q?s=AAPL150313C00123000',\n",
" '/q/op?s=AAPL&strike=124.00',\n",
" '/q?s=AAPL150313C00124000',\n",
" '/q/op?s=AAPL&strike=125.00',\n",
" '/q?s=AAPL150313C00125000',\n",
" '/q/op?s=AAPL&strike=126.00',\n",
" '/q?s=AAPL150313C00126000',\n",
" '/q/op?s=AAPL&strike=127.00',\n",
" '/q?s=AAPL150313C00127000',\n",
" '/q/op?s=AAPL&strike=128.00',\n",
" '/q?s=AAPL150313C00128000',\n",
" '/q/op?s=AAPL&strike=129.00',\n",
" '/q?s=AAPL150313C00129000',\n",
" '/q/op?s=AAPL&strike=130.00',\n",
" '/q?s=AAPL150313C00130000',\n",
" '/q/op?s=AAPL&strike=131.00',\n",
" '/q?s=AAPL150313C00131000',\n",
" '/q/op?s=AAPL&strike=132.00',\n",
" '/q?s=AAPL150313C00132000',\n",
" '/q/op?s=AAPL&strike=133.00',\n",
" '/q?s=AAPL150313C00133000',\n",
" '/q/op?s=AAPL&strike=134.00',\n",
" '/q?s=AAPL150313C00134000',\n",
" '/q/op?s=AAPL&strike=135.00',\n",
" '/q?s=AAPL150313C00135000',\n",
" '/q/op?s=AAPL&strike=136.00',\n",
" '/q?s=AAPL150313C00136000',\n",
" '/q/op?s=AAPL&strike=137.00',\n",
" '/q?s=AAPL150313C00137000',\n",
" '/q/op?s=AAPL&strike=138.00',\n",
" '/q?s=AAPL150313C00138000',\n",
" '/q/op?s=AAPL&strike=139.00',\n",
" '/q?s=AAPL150313C00139000',\n",
" '/q/op?s=AAPL&strike=140.00',\n",
" '/q?s=AAPL150313C00140000',\n",
" '/q/op?s=AAPL&strike=141.00',\n",
" '/q?s=AAPL150313C00141000',\n",
" '/q/op?s=AAPL&strike=142.00',\n",
" '/q?s=AAPL150313C00142000',\n",
" '/q/op?s=AAPL&strike=143.00',\n",
" '/q?s=AAPL150313C00143000',\n",
" '/q/op?s=AAPL&strike=144.00',\n",
" '/q?s=AAPL150313C00144000',\n",
" '/q/op?s=AAPL&strike=145.00',\n",
" '/q?s=AAPL150313C00145000',\n",
" '/q/op?s=AAPL&strike=146.00',\n",
" '/q?s=AAPL150313C00146000',\n",
" '/q/op?s=AAPL&strike=150.00',\n",
" '/q?s=AAPL150313C00150000',\n",
" None,\n",
" None,\n",
" '/q/op?s=AAPL&strike=85.00',\n",
" '/q?s=AAPL150313P00085000',\n",
" '/q/op?s=AAPL&strike=90.00',\n",
" '/q?s=AAPL150313P00090000',\n",
" '/q/op?s=AAPL&strike=95.00',\n",
" '/q?s=AAPL150313P00095000',\n",
" '/q/op?s=AAPL&strike=100.00',\n",
" '/q?s=AAPL150313P00100000',\n",
" '/q/op?s=AAPL&strike=101.00',\n",
" '/q?s=AAPL150313P00101000',\n",
" '/q/op?s=AAPL&strike=102.00',\n",
" '/q?s=AAPL150313P00102000',\n",
" '/q/op?s=AAPL&strike=103.00',\n",
" '/q?s=AAPL150313P00103000',\n",
" '/q/op?s=AAPL&strike=104.00',\n",
" '/q?s=AAPL150313P00104000',\n",
" '/q/op?s=AAPL&strike=105.00',\n",
" '/q?s=AAPL150313P00105000',\n",
" '/q/op?s=AAPL&strike=106.00',\n",
" '/q?s=AAPL150313P00106000',\n",
" '/q/op?s=AAPL&strike=107.00',\n",
" '/q?s=AAPL150313P00107000',\n",
" '/q/op?s=AAPL&strike=108.00',\n",
" '/q?s=AAPL150313P00108000',\n",
" '/q/op?s=AAPL&strike=109.00',\n",
" '/q?s=AAPL150313P00109000',\n",
" '/q/op?s=AAPL&strike=110.00',\n",
" '/q?s=AAPL150313P00110000',\n",
" '/q/op?s=AAPL&strike=111.00',\n",
" '/q?s=AAPL150313P00111000',\n",
" '/q/op?s=AAPL&strike=112.00',\n",
" '/q?s=AAPL150313P00112000',\n",
" '/q/op?s=AAPL&strike=113.00',\n",
" '/q?s=AAPL150313P00113000',\n",
" '/q/op?s=AAPL&strike=114.00',\n",
" '/q?s=AAPL150313P00114000',\n",
" '/q/op?s=AAPL&strike=115.00',\n",
" '/q?s=AAPL150313P00115000',\n",
" '/q/op?s=AAPL&strike=116.00',\n",
" '/q?s=AAPL150313P00116000',\n",
" '/q/op?s=AAPL&strike=117.00',\n",
" '/q?s=AAPL150313P00117000',\n",
" '/q/op?s=AAPL&strike=118.00',\n",
" '/q?s=AAPL150313P00118000',\n",
" '/q/op?s=AAPL&strike=119.00',\n",
" '/q?s=AAPL150313P00119000',\n",
" '/q/op?s=AAPL&strike=120.00',\n",
" '/q?s=AAPL150313P00120000',\n",
" '/q/op?s=AAPL&strike=121.00',\n",
" '/q?s=AAPL150313P00121000',\n",
" '/q/op?s=AAPL&strike=122.00',\n",
" '/q?s=AAPL150313P00122000',\n",
" '/q/op?s=AAPL&strike=123.00',\n",
" '/q?s=AAPL150313P00123000',\n",
" '/q/op?s=AAPL&strike=124.00',\n",
" '/q?s=AAPL150313P00124000',\n",
" '/q/op?s=AAPL&strike=125.00',\n",
" '/q?s=AAPL150313P00125000',\n",
" '/q/op?s=AAPL&strike=126.00',\n",
" '/q?s=AAPL150313P00126000',\n",
" '/q/op?s=AAPL&strike=127.00',\n",
" '/q?s=AAPL150313P00127000',\n",
" '/q/op?s=AAPL&strike=128.00',\n",
" '/q?s=AAPL150313P00128000',\n",
" '/q/op?s=AAPL&strike=129.00',\n",
" '/q?s=AAPL150313P00129000',\n",
" '/q/op?s=AAPL&strike=130.00',\n",
" '/q?s=AAPL150313P00130000',\n",
" '/q/op?s=AAPL&strike=131.00',\n",
" '/q?s=AAPL150313P00131000',\n",
" '/q/op?s=AAPL&strike=132.00',\n",
" '/q?s=AAPL150313P00132000',\n",
" '/q/op?s=AAPL&strike=133.00',\n",
" '/q?s=AAPL150313P00133000',\n",
" '/q/op?s=AAPL&strike=134.00',\n",
" '/q?s=AAPL150313P00134000',\n",
" '/q/op?s=AAPL&strike=135.00',\n",
" '/q?s=AAPL150313P00135000',\n",
" '/q/op?s=AAPL&strike=136.00',\n",
" '/q?s=AAPL150313P00136000',\n",
" '/q/op?s=AAPL&strike=137.00',\n",
" '/q?s=AAPL150313P00137000',\n",
" '/q/op?s=AAPL&strike=138.00',\n",
" '/q?s=AAPL150313P00138000',\n",
" '/q/op?s=AAPL&strike=139.00',\n",
" '/q?s=AAPL150313P00139000',\n",
" '/q/op?s=AAPL&strike=140.00',\n",
" '/q?s=AAPL150313P00140000',\n",
" '/q/op?s=AAPL&strike=141.00',\n",
" '/q?s=AAPL150313P00141000',\n",
" '/q/op?s=AAPL&strike=142.00',\n",
" '/q?s=AAPL150313P00142000',\n",
" '/q/op?s=AAPL&strike=143.00',\n",
" '/q?s=AAPL150313P00143000',\n",
" '/q/op?s=AAPL&strike=144.00',\n",
" '/q?s=AAPL150313P00144000',\n",
" '/q/op?s=AAPL&strike=145.00',\n",
" '/q?s=AAPL150313P00145000',\n",
" '/q/op?s=AAPL&strike=150.00',\n",
" '/q?s=AAPL150313P00150000']"
]
}
],
"prompt_number": 127
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, finding the right tables in the document can be a matter of trial and error; some websites make is easier by giving a table of interest an id attribute. These were the two tables containing the call data and put data, respectively."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tables = doc.findall('.//table')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 128
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"calls = tables[9]\n",
"puts = tables[13]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcalls\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtables\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m9\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtables\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m13\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mIndexError\u001b[0m: list index out of range"
]
}
],
"prompt_number": 129
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t0 = tables[0]\n",
"t1 = tables[1]\n",
"t2 = tables[2]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 136
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each table has a header row followed by each of the data rows."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rows0 = t0.findall('.//tr')\n",
"rows1 = t1.findall('.//tr')\n",
"rows2 = t2.findall('.//tr')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 139
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the header as well as the data rows, we want to extract the text from each cell; in the case of the header these are the th cells and td cells for the data."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def _unpack(row, kind='td'):\n",
" elts = row.findall('.//%s' %kind)\n",
" return [val.text_content() for val in elts]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 140
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"_unpack(rows2[2], kind='td')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 147,
"text": [
"['\\r\\n 85.00\\r\\n ',\n",
" '\\r\\n AAPL150313P00085000\\r\\n ',\n",
" '\\r\\n 0.01\\r\\n ',\n",
" '\\r\\n 0.00\\r\\n ',\n",
" '\\r\\n 0.01\\r\\n ',\n",
" '\\r\\n 0.00\\r\\n ',\n",
" '\\r\\n \\r\\n \\r\\n 0.00%\\r\\n \\r\\n \\r\\n \\r\\n ',\n",
" '\\r\\n 3\\r\\n ',\n",
" '\\r\\n 3\\r\\n ',\n",
" '\\r\\n 237.50%\\r\\n ']"
]
}
],
"prompt_number": 147
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it is a matter of combining all these steps together to convert this data into a dataframe. Since the numerical data is still in string format, we want to convert some, but perhaps not all of the columns to floating point format. Pandas has a class TextParser that is used internally in the read_csv and other parsing functions to do the appropriate automatic type conversion:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from pandas.io.parsers import TextParser"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 148
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def parse_options_data(table):\n",
" rows = table.findall('.//tr')\n",
" header = _unpack(rows2[0], kind = 'th')\n",
" data = [_unpack(r) for r in rows2[1:]]\n",
" return TextParser(data, names = header).get_chunk()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 158
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we invoke this parsing function on the lxml table objects and get DataFrame results:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"call_data = parse_options_data(t2)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 160
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"call_data.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Strike\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" \u2235 Filter\n",
" | \n",
" Contract Name | \n",
" \n",
" \n",
" Last\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Bid\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Ask\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Change\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" %Change\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Volume\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Open Interest\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
" \n",
" \n",
" Implied Volatility\n",
" \n",
" \ue004\n",
" \ue002\n",
" \n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" \\r\\n \\r\\n ... | \n",
" None | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" None | \n",
" NaN | \n",
" NaN | \n",
" None | \n",
"
\n",
" \n",
" 1 | \n",
" \\r\\n 85.00\\r\\n | \n",
" \\r\\n AAPL150313P00085000\\r\\n | \n",
" 0.01 | \n",
" 0 | \n",
" 0.01 | \n",
" 0 | \n",
" \\r\\n \\r\\n \\r\\n ... | \n",
" 3 | \n",
" 3 | \n",
" \\r\\n 237.50%\\r\\n | \n",
"
\n",
" \n",
" 2 | \n",
" \\r\\n 90.00\\r\\n | \n",
" \\r\\n AAPL150313P00090000\\r\\n | \n",
" 0.01 | \n",
" 0 | \n",
" 0.01 | \n",
" 0 | \n",
" \\r\\n \\r\\n \\r\\n ... | \n",
" 2 | \n",
" 2 | \n",
" \\r\\n 206.25%\\r\\n | \n",
"
\n",
" \n",
" 3 | \n",
" \\r\\n 95.00\\r\\n | \n",
" \\r\\n AAPL150313P00095000\\r\\n | \n",
" 0.01 | \n",
" 0 | \n",
" 0.01 | \n",
" 0 | \n",
" \\r\\n \\r\\n \\r\\n ... | \n",
" 10 | \n",
" 6061 | \n",
" \\r\\n 168.75%\\r\\n | \n",
"
\n",
" \n",
" 4 | \n",
" \\r\\n 100.00\\r\\n | \n",
" \\r\\n AAPL150313P00100000\\r\\n | \n",
" 0.01 | \n",
" 0 | \n",
" 0.01 | \n",
" 0 | \n",
" \\r\\n \\r\\n \\r\\n ... | \n",
" 2 | \n",
" 674 | \n",
" \\r\\n 140.63%\\r\\n | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 161,
"text": [
" \\n \\n Strike\\n \\n \ue004\\n \ue002\\n \\n \\n \u2235 Filter\\n \\\n",
"0 \\r\\n \\r\\n ... \n",
"1 \\r\\n 85.00\\r\\n \n",
"2 \\r\\n 90.00\\r\\n \n",
"3 \\r\\n 95.00\\r\\n \n",
"4 \\r\\n 100.00\\r\\n \n",
"\n",
" Contract Name \\\n",
"0 None \n",
"1 \\r\\n AAPL150313P00085000\\r\\n \n",
"2 \\r\\n AAPL150313P00090000\\r\\n \n",
"3 \\r\\n AAPL150313P00095000\\r\\n \n",
"4 \\r\\n AAPL150313P00100000\\r\\n \n",
"\n",
" \\n \\n Last\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 NaN \n",
"1 0.01 \n",
"2 0.01 \n",
"3 0.01 \n",
"4 0.01 \n",
"\n",
" \\n \\n Bid\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 NaN \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
" \\n \\n Ask\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 NaN \n",
"1 0.01 \n",
"2 0.01 \n",
"3 0.01 \n",
"4 0.01 \n",
"\n",
" \\n \\n Change\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 NaN \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
" \\n \\n %Change\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 None \n",
"1 \\r\\n \\r\\n \\r\\n ... \n",
"2 \\r\\n \\r\\n \\r\\n ... \n",
"3 \\r\\n \\r\\n \\r\\n ... \n",
"4 \\r\\n \\r\\n \\r\\n ... \n",
"\n",
" \\n \\n Volume\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 NaN \n",
"1 3 \n",
"2 2 \n",
"3 10 \n",
"4 2 \n",
"\n",
" \\n \\n Open Interest\\n \\n \ue004\\n \ue002\\n \\n \\n \\\n",
"0 NaN \n",
"1 3 \n",
"2 2 \n",
"3 6061 \n",
"4 674 \n",
"\n",
" \\n \\n Implied Volatility\\n \\n \ue004\\n \ue002\\n \\n \\n \n",
"0 None \n",
"1 \\r\\n 237.50%\\r\\n \n",
"2 \\r\\n 206.25%\\r\\n \n",
"3 \\r\\n 168.75%\\r\\n \n",
"4 \\r\\n 140.63%\\r\\n "
]
}
],
"prompt_number": 161
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Parsing XML with lxml.objectify % DATA NOT FOUND"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"XML (extensible markup language) is another common structured data format supporting hierarchical, nested data with metadata."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MTA publishes a number of data series about its bus and train services. Here we look at the performance data which is contained in a set of XML files. Each train or bus service has a different file containing monthly data as a series of XML records."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using lxml.objectify, we parse the file and get a reference to the root note of the XML file with getroot:"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"BINARY DATA FORMATS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the easiest ways to store data efficiently in binary format is using Python's built-in pickle serialization. Pandas objects all have a save method which writes the data to disk as a pickle:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!pwd"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/Users/sergulaydore/Documents/Face_Car/python_scripts/Python_for_Data_Analysis\r\n"
]
}
],
"prompt_number": 162
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%cd /Users/sergulaydore/pydata-book/ch06"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/Users/sergulaydore/pydata-book/ch06\n"
]
}
],
"prompt_number": 166
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"frame = pd.read_csv('./ex1.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 169
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"frame"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 170,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 170
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"frame.to_pickle('./frame_pickle')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 172
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.read_pickle('./frame_pickle')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 174,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 174
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Using HDF5 format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a number of tools that facilitate efficiently reading and writing large amounts of scientific data in binary format on disk. A popular industry-grade library for this is HDF5, which is a C library with interfaces in many other languages like Java, Python, and MATLAB. The \"HDF\" and HDF5 stands for hierarchical data format. Each HDF5 file contains an internal file system like node structure enabling you to store multiple datasets and suporting metadata. Compared with simpler formats, HDF5 supports on the fly compression with a variety of compressors, enabling data with repeated patters to be stored more efficiently. For very large datasets that don't fit into memory, HDF5 is a good choice as you can efficiently read and write small sections of much larger arrays."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py, each of which takes a different approach to the problem. h5py provides a direct, but high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible flexible data containers, table indexing, querying capability, and some support for out-of-core computations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas objects:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"store = pd.HDFStore('./mydata.h5')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 175
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"store['obj1'] = frame"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 177
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"store['obj1_col'] = frame['a']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 178
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"store"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 179,
"text": [
"\n",
"File path: ./mydata.h5\n",
"/obj1 frame (shape->[3,5])\n",
"/obj1_col series (shape->[3]) "
]
}
],
"prompt_number": 179
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Objects contained in the HDF5 file can be retrieved in a dict-like fashion:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"store['obj1']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" message | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" hello | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" world | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" foo | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 180,
"text": [
" a b c d message\n",
"0 1 2 3 4 hello\n",
"1 5 6 7 8 world\n",
"2 9 10 11 12 foo"
]
}
],
"prompt_number": 180
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Reading Microsoft Excel Files"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"xls_file = pd.ExcelFile('./data.xls')\n",
"table = xls_file.parse('Sheet1')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Interacting with HTML and Web APIs % TWITTER REQUIRES AUTHENTICATION AND ETC, so annoying"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many websites have public APIs providing data feeds via JSON or some other format. There a re a nuber of ways to access these APIs from Python; one easy to use method is the requests package. To search for the words \"python pandas\" on Twitter, we can make an HTTP GET request like so:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 182
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"url = 'https://api.twitter.com/1.1/search/tweets.json?q=%23freebandnames&since_id=24012619984051000&max_id=250126199840518145&result_type=mixed&count=4'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 215
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"resp = requests.get(url)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 216
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"resp"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 217,
"text": [
""
]
}
],
"prompt_number": 217
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response object's text attribute contains the content of the GET query. Many web APIs will return a JSON string that must be loaded into a Python object:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import json"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 218
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = json.loads(resp.text)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 219
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.keys()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 220,
"text": [
"[u'errors']"
]
}
],
"prompt_number": 220
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"from requests_oauthlib import OAuth1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 224
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Interacting with DataBases"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MSQL) are in wide use, and many alternative non-SQL databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.\n",
"\n",
"Loading data from SQL into a DataFrame is fairly straightforward."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sqlite3"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 225
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"query = \"\"\"\n",
"CREATE TABLE test\n",
"(a VARCHAR(20), b VARCHAR(20),\n",
"c REAL, d INTEGER\n",
"); \"\"\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 226
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"con = sqlite3.connect(':memory:')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 227
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"con.execute(query)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 228,
"text": [
""
]
}
],
"prompt_number": 228
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"con.commit()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 229
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, insert a few rows of data:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = [ ('Atlanta', 'Georgia', 1.25, 6),\n",
" ('Tallahassee', 'Florida', 2.6, 3),\n",
" ('Sacramento','California', 1.7, 5)\n",
" ]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 230
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stmt = \"INSERT INTO test VALUES(?,?,?,?)\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 231
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"con.executemany(stmt, data)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 232,
"text": [
""
]
}
],
"prompt_number": 232
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"con.commit()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 233
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most Python SQL drivers return a list of tuples when selecting data from a table:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cursor = con.execute('select * from test')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 234
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rows = cursor.fetchall()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 235
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rows"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 236,
"text": [
"[(u'Atlanta', u'Georgia', 1.25, 6),\n",
" (u'Tallahassee', u'Florida', 2.6, 3),\n",
" (u'Sacramento', u'California', 1.7, 5)]"
]
}
],
"prompt_number": 236
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass the list of tuples to the DataFrame consructor, but you also need the column names, contained in the cursor's description attribute:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cursor.description"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 237,
"text": [
"(('a', None, None, None, None, None, None),\n",
" ('b', None, None, None, None, None, None),\n",
" ('c', None, None, None, None, None, None),\n",
" ('d', None, None, None, None, None, None))"
]
}
],
"prompt_number": 237
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.DataFrame(rows, columns = zip(*cursor.description)[0])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Atlanta | \n",
" Georgia | \n",
" 1.25 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" Tallahassee | \n",
" Florida | \n",
" 2.60 | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" Sacramento | \n",
" California | \n",
" 1.70 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 238,
"text": [
" a b c d\n",
"0 Atlanta Georgia 1.25 6\n",
"1 Tallahassee Florida 2.60 3\n",
"2 Sacramento California 1.70 5"
]
}
],
"prompt_number": 238
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is quite a bit of munging that you'd rather not repeat each time you query the database. pandas has a read_frame function in its pandas.io.sql module that simplifies the process. Just pass the select statement and the connection object:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas.io.sql as sql"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 239
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sql.read_sql('select * from test', con)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Atlanta | \n",
" Georgia | \n",
" 1.25 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" Tallahassee | \n",
" Florida | \n",
" 2.60 | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" Sacramento | \n",
" California | \n",
" 1.70 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 241,
"text": [
" a b c d\n",
"0 Atlanta Georgia 1.25 6\n",
"1 Tallahassee Florida 2.60 3\n",
"2 Sacramento California 1.70 5"
]
}
],
"prompt_number": 241
}
],
"metadata": {}
}
]
}