{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to HDF5 (v.s. a database system)\n", "- a **database** handles efficiently *large numbers of transactions consisting of small pieces of data*. **HDF5** handles one or only *a few transactions consissting of large amounts of data*. It is important to identify the right tool for the right task. [source](http://dpservis.wordpress.com/tag/hdf5/)\n", "- **Pros** and **Cons** of HDF5\n", " - PROS\n", " 1. HDF5 operates on single local files that can be transferred using OS commands.\n", " 2. HDF5 provides the means to arrange information in a hierarchical tree and access it randomly independently of the application that generated the information.\n", " 3. HDF5 provides a dynamic mapping paradigm.\n", " 4. Performance-wise, it is often faster than native calls.\n", " - CONS\n", " 1. It is an open format and self describing, therefore making the data transparent and completely open. This could be a drawback for proprietary applications. Actually it is one of the major ones.\n", " 2. There is no simultaneous access support at the dataset level. This means that it is not possible to write to the same dataset from multiple clients and read from multiple clients at the same time.\n", " 3. Locking/unlocking, journaling, composing transactions are things that are either not implemented yet or lie completely on the application side." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "from StringIO import StringIO\n", "from IPython.core.display import display\n", "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reader and Writer\n", "- Pandas readers is a set of read_xx() functions that generally return a pandas object.\n", "- The corresponding writer functions are object methods that are accessed like df.to_xxx()\n", "- main list of readers and writers\n", " - read_csv, to_csv (read_table, to_table)\n", " - read_json, to_json\n", " - read_html, to_html\n", " - read_pickle, to_pickle\n", " - read_clipboard, to_clipboard\n", " - read_excel, to_excel\n", " - read_hdf, to_hdf\n", " - read_sql, to_sql\n", " - read_stata, to_stata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CSV and Text Files" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## Specifying index column\n", "foo_csv = StringIO(\"\"\"date,A,B,C\n", "20090101,a,1,2\n", "20090102,b,3,4\n", "20090103,c,4,5\n", "\"\"\")\n", "print '>> Without specifying index column'\n", "foo_csv.seek(0)\n", "display( pd.read_csv(foo_csv))\n", "print '>> index column = 0, COLUMN INDEXING starts with 0'\n", "foo_csv.seek(0)\n", "display( pd.read_csv(foo_csv, index_col=0))\n", "print '>> index column = date'\n", "foo_csv.seek(0)\n", "display( pd.read_csv(foo_csv, index_col='date'))\n", "print '>> parsing index column date date'\n", "foo_csv.seek(0)\n", "display( pd.read_csv(foo_csv, index_col='date', parse_dates='date'))\n", "print \">> hierachical index = [0, 'A']\"\n", "foo_csv.seek(0)\n", "display( pd.read_csv(foo_csv, index_col = [0, 'A'], \n", " parse_dates = 0))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ ">> Without specifying index column\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateABC
0 20090101 a 1 2
1 20090102 b 3 4
2 20090103 c 4 5
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " date A B C\n", "0 20090101 a 1 2\n", "1 20090102 b 3 4\n", "2 20090103 c 4 5" ] }, { "output_type": "stream", "stream": "stdout", "text": [ ">> index column = 0, COLUMN INDEXING starts with 0\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B C\n", "date \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, { "output_type": "stream", "stream": "stdout", "text": [ ">> index column = date\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B C\n", "date \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, { "output_type": "stream", "stream": "stdout", "text": [ ">> parsing index column date date\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B C\n", "date \n", "2009-01-01 a 1 2\n", "2009-01-02 b 3 4\n", "2009-01-03 c 4 5" ] }, { "output_type": "stream", "stream": "stdout", "text": [ ">> hierachical index = [0, 'A']\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
dateA
2009-01-01a 1 2
2009-01-02b 3 4
2009-01-03c 4 5
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " B C\n", "date A \n", "2009-01-01 a 1 2\n", "2009-01-02 b 3 4\n", "2009-01-03 c 4 5" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "## Speaking different CSV dialects\n", "## Use dialect keyword to specify the file format, by default\n", "## it uses Excel dialect but you can specify either the \n", "## dialect name or a csv.Dialect instace\n", "data = \"\"\"label1,label2,label3\n", "index1,\"a,c,e\n", "index2,b,d,f\n", "\"\"\"\n", "## By default, read_csv uses the excel dialect and treats \n", "## the double quote \"\" as the quote character, which causes\n", "## it to fail when it finds a newline before it finds the \n", "## closing double quote.\n", "try:\n", " pd.read_csv(StringIO(data))\n", "except:\n", " print 'CParserError raised'\n", " \n", "## we can get around this by using dialect\n", "import csv\n", "dia = csv.excel()\n", "dia.quoting = csv.QUOTE_NONE\n", "display(pd.read_csv(StringIO(data), dialect = dia))\n", "\n", "## equivalently\n", "display(pd.read_csv(StringIO(data), quoting=csv.QUOTE_NONE))\n", "\n", "## THINGS TO NOTE: all of dialect options can be specified\n", "## separately by keyword arguments including\n", "## *thousands* (specifies the thousands separator), \n", "## *lineterminator* (str of len 1, char to break files into lines)\n", "## *quotechar* (str, the char to denote start/end of a quote item), \n", "## Quoted items can include the delimiter and it will be ignored (main purpose of quoting)\n", "## *quoting* (int, csv.QUOTE_MINIMAL, csv.QUOTE_ALL, csv.QUOTE_NONE, csv.QUOTE_NONNUMERIC)\n", "## *skipinitialspace* (default=false, ignore spaces after delimiter)\n", "## *escapechar* (str specifying how to escape quoted data)\n", "## *comment* ()\n", "## *encoding* (e.g., 'utf-8' or 'latin-1')\n", "## LOOK UP csv.Dialect document for details\n", "## e.g.\n", "data = 'a, b, c~1, 2, 3~4, 5, 6' #extra whitespace, new liner\n", "display(pd.read_csv(StringIO(data), lineterminator='~', \n", " skipinitialspace=True))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CParserError raised\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
label1label2label3
index1 \"a c e
index2 b d f
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " label1 label2 label3\n", "index1 \"a c e\n", "index2 b d f" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
label1label2label3
index1 \"a c e
index2 b d f
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " label1 label2 label3\n", "index1 \"a c e\n", "index2 b d f" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1 2 3
1 4 5 6
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " a b c\n", "0 1 2 3\n", "1 4 5 6" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "## Specifying column data types\n", "## You can indicate the data type for the whole df\n", "## or individual columns\n", "data = 'a,b,c\\n1,2,3\\n4,5,6\\n7,8,9'\n", "print pd.read_csv(StringIO(data)).dtypes\n", "print pd.read_csv(StringIO(data), dtype='object').dtypes\n", "print pd.read_csv(StringIO(data), \n", " dtype = {'a':'str', 'b':'float'}).dtypes\n", "## strings are objects in numpy" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "a int64\n", "b int64\n", "c int64\n", "dtype: object\n", "a object\n", "b object\n", "c object\n", "dtype: object\n", "a object\n", "b float64\n", "c int64\n", "dtype: object\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "## Overwrite header names by specifying 'names' param\n", "## you can choose to keep or throw away original headers\n", "## by specifying 'header' param\n", "data = \"\"\"a,b,c\n", "1,2,3\n", "4,5,6\n", "7,8,9\n", "\"\"\"\n", "display(pd.read_csv(StringIO(data)))\n", "display(pd.read_csv(StringIO(data), names = list(\"ABC\"), header = None))\n", "display(pd.read_csv(StringIO(data), names = list(\"ABC\"), header = 0))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1 2 3
1 4 5 6
2 7 8 9
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " a b c\n", "0 1 2 3\n", "1 4 5 6\n", "2 7 8 9" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0 a b c
1 1 2 3
2 4 5 6
3 7 8 9
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B C\n", "0 a b c\n", "1 1 2 3\n", "2 4 5 6\n", "3 7 8 9" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0 1 2 3
1 4 5 6
2 7 8 9
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B C\n", "0 1 2 3\n", "1 4 5 6\n", "2 7 8 9" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "## Filtering columns (usecols)\n", "## the 'usecols' param allows you to select any subset of the\n", "## columns in a file, either the column names or position\n", "## Based on doc, it usually results in much faster parsing time and lower memory usage.\n", "data = 'a,b,c,d\\n1,2,3,foo\\n4,5,6,bar\\n7,8,9,baz'\n", "display(pd.read_csv(StringIO(data)))\n", "display(pd.read_csv(StringIO(data), usecols=['b', 'd']))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0 1 2 3 foo
1 4 5 6 bar
2 7 8 9 baz
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " a b c d\n", "0 1 2 3 foo\n", "1 4 5 6 bar\n", "2 7 8 9 baz" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bd
0 2 foo
1 5 bar
2 8 baz
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " b d\n", "0 2 foo\n", "1 5 bar\n", "2 8 baz" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "## Dealing with Unicode data\n", "## the encoding arg should be used for encoded unicode data\n", "## which will result in byte strings being decoded to unicode\n", "## in the result\n", "data = 'word,length\\nTr\\xe4umen,7\\nGr\\xfc\\xdfe,5'\n", "df = pd.read_csv(StringIO(data), encoding='latin-1')\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordlength
0 Tr\u00e4umen 7
1 Gr\u00fc\u00dfe 5
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ " word length\n", "0 Tr\u00e4umen 7\n", "1 Gr\u00fc\u00dfe 5" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "## Specifying date columns\n", "## the \"parse_dates\" and \"date_parser\" params allow \n", "## users to specify a variety of cols and date/time formats\n", "tmp_data = \"\"\"KORD,19990127, 19:00:00, 18:56:00, 0.8100\n", "KORD,19990127, 20:00:00, 19:56:00, 0.0100\n", "KORD,19990127, 21:00:00, 20:56:00, -0.5900\n", "KORD,19990127, 21:00:00, 21:18:00, -0.9900\n", "KORD,19990127, 22:00:00, 21:56:00, -0.5900\n", "KORD,19990127, 23:00:00, 22:56:00, -0.5900\n", "\"\"\"\n", "\n", "## COMBINE DIFFERENT COLS to parse date\n", "pd.read_csv(StringIO(tmp_data), header = None, \n", " parse_dates={'end': [1, 2], 'start': [1, 3]},)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startend04
01999-01-27 18:56:001999-01-27 19:00:00 KORD 0.81
11999-01-27 19:56:001999-01-27 20:00:00 KORD 0.01
21999-01-27 20:56:001999-01-27 21:00:00 KORD-0.59
31999-01-27 21:18:001999-01-27 21:00:00 KORD-0.99
41999-01-27 21:56:001999-01-27 22:00:00 KORD-0.59
51999-01-27 22:56:001999-01-27 23:00:00 KORD-0.59
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ " start end 0 4\n", "0 1999-01-27 18:56:00 1999-01-27 19:00:00 KORD 0.81\n", "1 1999-01-27 19:56:00 1999-01-27 20:00:00 KORD 0.01\n", "2 1999-01-27 20:56:00 1999-01-27 21:00:00 KORD -0.59\n", "3 1999-01-27 21:18:00 1999-01-27 21:00:00 KORD -0.99\n", "4 1999-01-27 21:56:00 1999-01-27 22:00:00 KORD -0.59\n", "5 1999-01-27 22:56:00 1999-01-27 23:00:00 KORD -0.59" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "## Handling Boolean Values\n", "## use true_values and false_values to specify true/false values\n", "data = 'a,b,c\\n1,Yes,2\\n3,No,4'\n", "df1 = pd.read_csv(StringIO(data))\n", "display(df1)\n", "print df1.dtypes\n", "df2 = pd.read_csv(StringIO(data), true_values=['Yes'], false_values=['No'])\n", "display(df2)\n", "print df2.dtypes" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1 Yes 2
1 3 No 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " a b c\n", "0 1 Yes 2\n", "1 3 No 4" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "a int64\n", "b object\n", "c int64\n", "dtype: object\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1 True 2
1 3 False 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " a b c\n", "0 1 True 2\n", "1 3 False 4" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "a int64\n", "b bool\n", "c int64\n", "dtype: object\n" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "## AUTO skipping bad lines\n", "data = 'a,b,c\\n1,2,3\\n4,5,6,7\\n8,9,10'\n", "try:\n", " pd.read_csv(StringIO(data))\n", "except Exception, ex:\n", " print ex\n", "\n", "display(pd.read_csv(StringIO(data), error_bad_lines=False))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Error tokenizing data. C error: Expected 3 fields in line 3, saw 4\n", "\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "Skipping line 3: expected 3 fields, saw 4\n", "\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1 2 3
1 8 9 10
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " a b c\n", "0 1 2 3\n", "1 8 9 10" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "## Quoting and Escaping characters\n", "## Quotes (and other escape characters) in embedded fields can be handled in any number of ways. \n", "## One way is to use backslashes; \n", "## to properly parse this data, you should pass the escapechar option\n", "data = 'a,b\\n\"hello, \\\\\"Bob\\\\\", nice to see you\",5'\n", "pd.read_csv(StringIO(data), escapechar='\\\\')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0 hello, \"Bob\", nice to see you 5
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " a b\n", "0 hello, \"Bob\", nice to see you 5" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "## ITERATING through files chunk by chunk\n", "## Suppose you wish to iterate through a (potentially very large) file \n", "## LAZILY rather than reading the entire file into memory, \n", "## such as the following:\n", "\n", "## 1. BY SPECIFYING \"chunksize\" to read_csv() ror read_table(),\n", "## THE RETURN VALUE WILL BE AN ITERABLE OBJECT OF TYPE\n", "## \"TextFileReader\"\n", "\n", "tmp_sv = \"\"\"|0|1|2|3\n", "0|0.4691122999071863|-0.2828633443286633|-1.5090585031735124|-1.1356323710171934\n", "1|1.2121120250208506|-0.17321464905330858|0.11920871129693428|-1.0442359662799567\n", "2|-0.8618489633477999|-2.1045692188948086|-0.4949292740687813|1.071803807037338\n", "3|0.7215551622443669|-0.7067711336300845|-1.0395749851146963|0.27185988554282986\n", "4|-0.42497232978883753|0.567020349793672|0.27623201927771873|-1.0874006912859915\n", "5|-0.6736897080883706|0.1136484096888855|-1.4784265524372235|0.5249876671147047\n", "6|0.4047052186802365|0.5770459859204836|-1.7150020161146375|-1.0392684835147725\n", "7|-0.3706468582364464|-1.1578922506419993|-1.344311812731667|0.8448851414248841\n", "8|1.0757697837155533|-0.10904997528022223|1.6435630703622064|-1.4693879595399115\n", "9|0.35702056413309086|-0.6746001037299882|-1.776903716971867|-0.9689138124473498\n", "\"\"\"\n", "reader = pd.read_table(StringIO(tmp_sv), sep = '|', chunksize=4)\n", "print type(reader)\n", "for chunk in reader:\n", " print type(chunk)\n", " print chunk" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "\n", " Unnamed: 0 0 1 2 3\n", "0 0 0.469112 -0.282863 -1.509059 -1.135632\n", "1 1 1.212112 -0.173215 0.119209 -1.044236\n", "2 2 -0.861849 -2.104569 -0.494929 1.071804\n", "3 3 0.721555 -0.706771 -1.039575 0.271860\n", "\n", " Unnamed: 0 0 1 2 3\n", "0 4 -0.424972 0.567020 0.276232 -1.087401\n", "1 5 -0.673690 0.113648 -1.478427 0.524988\n", "2 6 0.404705 0.577046 -1.715002 -1.039268\n", "3 7 -0.370647 -1.157892 -1.344312 0.844885\n", "\n", " Unnamed: 0 0 1 2 3\n", "0 8 1.075770 -0.10905 1.643563 -1.469388\n", "1 9 0.357021 -0.67460 -1.776904 -0.968914\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "## 2. SPECIFYING iterator = True WILL ALSO RETURN THE\n", "## TextFileReader object\n", "reader = pd.read_table(StringIO(tmp_sv), sep=\"|\", iterator=True, )\n", "chunk = reader.get_chunk(5)\n", "print type(chunk)\n", "print chunk" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", " Unnamed: 0 0 1 2 3\n", "0 0 0.469112 -0.282863 -1.509059 -1.135632\n", "1 1 1.212112 -0.173215 0.119209 -1.044236\n", "2 2 -0.861849 -2.104569 -0.494929 1.071804\n", "3 3 0.721555 -0.706771 -1.039575 0.271860\n", "4 4 -0.424972 0.567020 0.276232 -1.087401\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "## WRITING to CSV\n", "## two BOOL parameters 'header' and 'index' specifying \n", "## whether writing header and index information to the file" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## JSON and HTML\n", "- Important params: \n", " - orient: \n", " 1. For Series (default = index): {split, records, index}\n", " 2. For DataFrame (default = columns): {split, records, index, columns, values}\n", "\n", "The format of the JSON string\n", "\n", "*split*: dict like {index -> [index], columns -> [columns], data -> [values]}\n", "\n", "*records*: list like [{column -> value}, ... , {column -> value}]\n", "\n", "*index*: dict like {index -> {column -> value}}\n", "\n", "*columns*: dict like {column -> {index -> value}}\n", "\n", "*values*: just the values array" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "HTML (df.to_html())" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordlength
0 Tr\u00e4umen 7
1 Gr\u00fc\u00dfe 5
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pickling and Serialization\n", "- All pandas objects are equipped with to_pickle() methods using python's cPickle module\n", "- The read_pickle() function can be used to load any pandas object from file\n", "- The previous save() and load() methods are now deprecated" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.to_pickle('data/foo.pkl')\n", "!ls data/foo*" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "data/foo.pkl\r\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "dff = pd.read_pickle('data/foo.pkl')\n", "dff" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordlength
0 Tr\u00e4umen 7
1 Gr\u00fc\u00dfe 5
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ " word length\n", "0 Tr\u00e4umen 7\n", "1 Gr\u00fc\u00dfe 5" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Excel File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HDF5 (PyTables)\n", "- HDFStore is a **dict-like** object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library.\n", "- Objects can be written to the file jsut like adding key-value pairs to a dict.\n", "- Retrieve objects by specifying the key (object name)\n", "- use pd.get_store(\"filename\") to use hdf5 store in a managed context" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### HDF5 Introduction" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## create or read a store \n", "import pandas as pd\n", "from numpy.random import randn\n", "store = pd.HDFStore('data/store.h5', 'w')\n", "print store\n", "## adding some objects\n", "index = pd.date_range('1/1/2000', periods=8, freq='D')\n", "s = pd.Series(randn(5), index = list(\"abcde\"))\n", "df = pd.DataFrame(randn(8, 3), index = index, columns = list(\"ABC\"))\n", "store['s'] = s\n", "store['df'] = df\n", "print store\n", "## read object\n", "print store['df']\n", "store.close()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "File path: data/store.h5\n", "Empty\n", "\n", "File path: data/store.h5\n", "/df frame (shape->[8,3])\n", "/s series (shape->[5]) \n", " A B C\n", "2000-01-01 -0.855040 0.713266 0.053736\n", "2000-01-02 1.434090 1.091144 1.169743\n", "2000-01-03 -0.650549 -1.112457 -1.901135\n", "2000-01-04 0.047931 0.809756 -0.221455\n", "2000-01-05 0.536011 -0.320523 -0.321210\n", "2000-01-06 -0.326509 -2.570665 -0.559427\n", "2000-01-07 1.207612 1.922518 -0.475120\n", "2000-01-08 -0.484949 1.387578 1.840412\n" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "## read back objects as from a dict\n", "store = pd.HDFStore('data/store.h5', 'r')\n", "print store.df" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " A B C\n", "2000-01-01 -0.855040 0.713266 0.053736\n", "2000-01-02 1.434090 1.091144 1.169743\n", "2000-01-03 -0.650549 -1.112457 -1.901135\n", "2000-01-04 0.047931 0.809756 -0.221455\n", "2000-01-05 0.536011 -0.320523 -0.321210\n", "2000-01-06 -0.326509 -2.570665 -0.559427\n", "2000-01-07 1.207612 1.922518 -0.475120\n", "2000-01-08 -0.484949 1.387578 1.840412\n" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "## delete an object as from a dict\n", "store.close()\n", "store = pd.HDFStore('data/store.h5', 'a') # reopen in a mode\n", "del store['s']\n", "store.close()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "## use hdf5 in a managed context (SO painful to rmbr to close the file)\n", "with pd.get_store('data/store.h5') as store:\n", " print store.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['/df']\n" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### HDF5 Read/Write API\n", "- HDFStore supports an top-level API using read_hdf() and to_hdf() for writing." ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame(dict(A = range(5), B = range(5)))\n", "## Append param is significant here\n", "df.to_hdf('data/store.h5', 'atable', append = True) ## make it a table\n", "pd.read_hdf('data/store.h5', 'atable', where = ['index>2'])\n", "with pd.get_store('data/store.h5') as store:\n", " print store" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "File path: data/store.h5\n", "/atable frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])\n", "/df frame (shape->[8,3]) \n" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two Storing Formats of HDF5\n", "- Storer Format: when data are stored to hdf5 using `put()`, it writed the HDF5 to PyTables in a *fixed* array format, called the **storer** format. These types of sotres are **not appendable** once written (though you can simply remove them and rewrite). **Nor are they queryable**. They must be retrieved in their entirety. These offer very fast writing and sligthly faster reading than `table` stores\n", "- Table Format: HDStore supports another PyTables format on disk, the `table` format. Conecputually a table is shaped very much like a DataFrame, with rows and columns. A `table` may be appended in the same or other sessions. In addition, delete & query type operations are supported\n", "- `storer` format is assumped by default. `table` format is stipulated by using `append()` method or `append=True` parameter" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## A storer format will raise a TypeError if you try to\n", "## tetrieve using a \"where\" parameter\n", "DataFrame(rand(10, 2)).to_hdf('data/simple_storer.h5', 'df')\n", "pd.read_hdf('data/simple_storer.h5', 'df', where = 'index > 5')" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'DataFrame' is not defined", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m## A storer format will raise a TypeError if you try to\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;31m## tetrieve using a \"where\" parameter\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mDataFrame\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrand\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m10\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m2\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mto_hdf\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'data/simple_storer.h5'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'df'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 4\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_hdf\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'data/simple_storer.h5'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'df'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mwhere\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'index > 5'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mNameError\u001b[0m: name 'DataFrame' is not defined" ] } ], "prompt_number": 25 }, { "cell_type": "raw", "metadata": {}, "source": [ "with pd.get_store('data/store.h5') as store:\n", " del store['df_table']" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## On the other hand, table format (by using append or append parameter)\n", "## supports appending and query\n", "with pd.get_store('data/store.h5') as store:\n", " df1 = df.iloc[:2]\n", " df2 = df.iloc[2:]\n", " store.append('df_table', df1)\n", " df2.to_hdf('data/store.h5', 'df_table', append=True)\n", " print store" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "File path: data/store.h5\n", "/atable frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])\n", "/df frame (shape->[8,3]) \n", "/df_table frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index]) \n" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "## Now the table format in store is querable\n", "with pd.get_store('data/store.h5') as store:\n", " display(store.df_table)\n", " display(store.select('df_table', where=['index > 2']))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B\n", "0 0 0\n", "1 1 1\n", "2 2 2\n", "3 3 3\n", "4 4 4" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
3 3 3
4 4 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B\n", "3 3 3\n", "4 4 4" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hierarchical Keys in HDF\n", "- Keys in HDF store are STRINGS. These can be in a hierarchical pathname like format (e.g., `foo/bar/bah`), which will generate a herarchy of sub-stores (or `Groups` in PyTables parlance). **Keys can be specified with out the leading '/' and are ALWAYS absolute (e.g. 'foo' refers to '/foo'). Removal operations can remove everything in the sub-store and BELOW, so be careful.**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## Hierarchical nature of HDF store\n", "with pd.get_store('data/store.h5') as store:\n", " store.put('foo/bar/bah', df)\n", " store.append('food/orange', df)\n", " store.append('food/apple', df)\n", " print '\\n', store\n", " print '\\n', store.keys()\n", " store.remove('food')\n", " print '\\n', store" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "\n", "File path: data/store.h5\n", "/atable frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])\n", "/df frame (shape->[8,3]) \n", "/df_table frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index]) \n", "/food/apple frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index]) \n", "/food/orange frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index]) \n", "/foo/bar/bah frame (shape->[5,2]) \n", "\n", "['/atable', '/df', '/df_table', '/food/apple', '/food/orange', '/foo/bar/bah']\n", "\n", "\n", "File path: data/store.h5\n", "/atable frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])\n", "/df frame (shape->[8,3]) \n", "/df_table frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index]) \n", "/foo/bar/bah frame (shape->[5,2]) \n" ] } ], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Querying a table in HDF\n", "- `select()` and `delete()` operations have optional criterion that can be specified to select/delete only a subset of data. **This allows one to have a very large on-disk table and retrieve only a portion of the data **.\n", "- A **query** is specified using the `Term` class under the hood. \n", " - `index` and `columns` are supported indexers of a DataFrame\n", " - `major_axis`, `minor_axis`, and `items` are supported indexers of the Panel\n", "- `Term` can be created from `dict`, `list`, `tuple` or `string`. Allowed operations are <, <=, >, >=, =,!=. And = will be inferred as an implicit set operation (e.g., if two or more values are provided). The following are all valid terms:\n", " - dict(field = 'index', op = '>', value = '20121114')\n", " - ('index', '>', '20121114')\n", " - 'index > 20121114'\n", " - ('index', '>', datetime(2012, 11, 14))\n", " - ('index', ['20121114', '20121115'])\n", " - ('major_axis', '=', Timestamp('2012/11/14'))\n", " - ('minor_axis', ['A', 'B'])\n", "- `Queries` are built up using `a list of Terms` (currently only adding of terms is supported). E.g., ['major_axis > 20000102', ('minor_axis', '=', ['A', 'B'])]. This is roughly translated to : major_axis must be greater than the date 20000102 and the minor_axis must be A or B\n", "- The `columns` keyword can be supplied to select a list of columns to be returned, this is equivalent to passing a Term('columns', list_of_columns_to_filter)\n", "- ***Specify the `data_columns` parameter to create indexing for columns in dataframe (table format only)***: There is some performance degredation by making lots of columns into data columns, so it is up to the user to designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of course you can simply read in the data and create a new table!)\n", "- Starting in 0.11, you can pass, iterator=True or chunksize=number_in_a_chunk to select and select_as_multiple to return an iterator on the results. The default is 50,000 rows returned in a chunk New in version 0.12.\n", "You can also use the iterator with read_hdf which will open, then automatically close the store when finished iterating..\n", "- ***Be very careful with the np.nan representation when it comes to select() method ***\n", "- Use `select_column()` to select a single column from table in a faster way" ] }, { "cell_type": "code", "collapsed": false, "input": [ "with pd.get_store('data/store.h5') as store:\n", " display(store.df_table)\n", " print 'QUERY CAN ONLY BE APPLIED TO table BUT NOT TO storer'\n", " print '>> Selecting columns B and C only'\n", " display(store.select('df_table', columns=['B']))\n", " print '>> Equivalently using where='\n", " display(store.select('df_table', \n", " where = [('columns', '=', 'B')]))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " A B\n", "0 0 0\n", "1 1 1\n", "2 2 2\n", "3 3 3\n", "4 4 4" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "QUERY CAN ONLY BE APPLIED TO table BUT NOT TO storer\n", ">> Selecting columns B and C only\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
B
0 0
1 1
2 2
3 3
4 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " B\n", "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4" ] }, { "output_type": "stream", "stream": "stdout", "text": [ ">> Equivalently using where=\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
B
0 0
1 1
2 2
3 3
4 4
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " B\n", "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4" ] } ], "prompt_number": 48 }, { "cell_type": "code", "collapsed": false, "input": [ "with pd.get_store('data/store.h5') as store:\n", " store.create_table_index('atable', optlevel=9, kind='full')\n", " print store.atable.dtypes\n", " print store.select('atable', where = ['index > 1', 'index < 3'])\n", " ## NOT VALID - BECAUSE DATA_COLUMNS ARE NOT INDEX YET\n", " #print store.select('atable', where = pd.Term('B > 3'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "A int64\n", "B int64\n", "dtype: object\n", " A B\n", "2 2 2\n", "2 2 2\n" ] } ], "prompt_number": 67 }, { "cell_type": "code", "collapsed": false, "input": [ "## INDEX DATA COLUMNS SO QUERY CAN BE MADE ON THEM LATER\n", "with pd.get_store('data/store.h5') as store:\n", " dff = store.atable.copy()\n", " dff.index = range(dff.shape[0])\n", " dff.ix[:3, 'A'] = 100\n", " dff['string'] = 'foo'\n", " dff.ix[1:3, 'string'] = np.nan\n", " dff.iloc[7:9]['string'] = 'bar'\n", " dff['string2'] = 'cool'\n", " print dff\n", " ## indexing data columns\n", " ## BUILD ONCE AND IT WILL ALWAYS THERE!!\n", " ## SINCE IT SEEMS THAT YOU CAN ONLY DO THE INDEXING OF COLUMNS\n", " ## ONCE WHEN YOU CREATE THE TABLE, PROBABLY YOU SHOULD INCLUDE\n", " ## ALL INTERESTING COLUMNS AT THIS MOMENT.\n", " store.append('dff', dff, data_columns=['A', 'B', 'string', 'string2'])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " A B string string2\n", "0 100 0 foo cool\n", "1 100 1 NaN cool\n", "2 100 2 NaN cool\n", "3 100 3 NaN cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n" ] } ], "prompt_number": 152 }, { "cell_type": "raw", "metadata": {}, "source": [ "## need to delete the table and recreate if \n", "## adding new data columns to indexing\n", "with pd.get_store('data/store.h5') as store:\n", " del store['dff']" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## Selecting NaN values with string 'nan'\n", "with pd.get_store('data/store.h5') as store:\n", " print store.select('dff', where=['string == foo', 'B == 4'])\n", " print \n", " print store.select('dff', where = ('string != nan'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " A B string string2\n", "4 4 4 foo cool\n", "9 4 4 foo cool\n", "4 4 4 foo cool\n", "9 4 4 foo cool\n", "4 4 4 foo cool\n", "9 4 4 foo cool\n", "4 4 4 foo cool\n", "9 4 4 foo cool\n", "4 4 4 foo cool\n", "9 4 4 foo cool\n", "\n", " A B string string2\n", "0 0 0 foo cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n", "0 0 0 foo cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n", "0 0 0 foo cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n", "0 -9223372036854775808 0 foo cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n", "0 100 0 foo cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n" ] } ], "prompt_number": 153 }, { "cell_type": "code", "collapsed": false, "input": [ "## Iterating mode - the in-context version of iterating\n", "for df in pd.read_hdf('data/store.h5', 'dff', chunksize = 3):\n", " print df" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " A B string string2\n", "0 0 0 foo cool\n", "1 1 1 NaN cool\n", "2 2 2 NaN cool\n", " A B string string2\n", "3 3 3 NaN cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", " A B string string2\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", " A B string string2\n", "9 4 4 foo cool\n", "0 0 0 foo cool\n", "1 1 1 NaN cool\n", " A B string string2\n", "2 2 2 NaN cool\n", "3 3 3 NaN cool\n", "4 4 4 foo cool\n", " A B string string2\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", " A B string string2\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n", "0 0 0 foo cool\n", " A B string string2\n", "1 1 1 NaN cool\n", "2 2 2 NaN cool\n", "3 3 3 NaN cool\n", " A B string string2\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", " A B string string2\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n", " A B string string2\n", "0 -9223372036854775808 0 foo cool\n", "1 -9223372036854775808 1 NaN cool\n", "2 -9223372036854775808 2 NaN cool" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", " A B string string2\n", "3 -9223372036854775808 3 NaN cool\n", "4 4 4 foo cool\n", "5 0 0 foo cool\n", " A B string string2\n", "6 1 1 foo cool\n", "7 2 2 bar cool\n", "8 3 3 bar cool\n", " A B string string2\n", "9 4 4 foo cool\n", "0 100 0 foo cool\n", "1 100 1 NaN cool\n", " A B string string2\n", "2 100 2 NaN cool\n", "3 100 3 NaN cool\n", "4 4 4 foo cool\n", " A B string string2\n", "5 0 0 foo cool\n", "6 1 1 foo cool\n", "7 2 2 bar cool" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", " A B string string2\n", "8 3 3 bar cool\n", "9 4 4 foo cool\n" ] } ], "prompt_number": 154 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Queries" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sqlite3\n", "from pandas.io import sql\n", "df = pd.DataFrame(dict(id = [26, 42, 63], \n", " Date = pd.date_range(\"20121018\", periods=3, freq='D'),\n", " Col1 = list('XYZ'),\n", " Col2 = [25.7, -12.4, 5.73], \n", " Col3 = [True, False, True]))\n", "conn = sqlite3.connect(\":memory:\")\n", "display(df)\n", "df.to_sql('df', conn)\n", "pd.read_sql('select Col1, Col2, Date from df where Date > \"2012-10-19\"', conn)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2Col3Dateid
0 X 25.70 True2012-10-18 00:00:00 26
1 Y-12.40 False2012-10-19 00:00:00 42
2 Z 5.73 True2012-10-20 00:00:00 63
\n", "
" ], "metadata": {}, "output_type": "display_data", "text": [ " Col1 Col2 Col3 Date id\n", "0 X 25.70 True 2012-10-18 00:00:00 26\n", "1 Y -12.40 False 2012-10-19 00:00:00 42\n", "2 Z 5.73 True 2012-10-20 00:00:00 63" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Col1Col2Date
0 Y-12.40 2012-10-19 00:00:00
1 Z 5.73 2012-10-20 00:00:00
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 162, "text": [ " Col1 Col2 Date\n", "0 Y -12.40 2012-10-19 00:00:00\n", "1 Z 5.73 2012-10-20 00:00:00" ] } ], "prompt_number": 162 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STATA, Data Readers" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }