{ "metadata": { "name": "", "signature": "sha256:bf05ffdbcedd414ccb3497c51175c8fda710bfac942d0c947d925bac429448c7" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "s = pd.read_csv('sample.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's some rather messy data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryWholenumDecimalYear
0 A 2 0.6 20014
1 A <1 5 2013
2 B 3 4 2012
3 B four -2 2011
4 C -5 >0 210
5 c NaN NaN 2009
\n", "

6 rows \u00d7 4 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ " Category Wholenum Decimal Year\n", "0 A 2 0.6 20014\n", "1 A <1 5 2013\n", "2 B 3 4 2012\n", "3 B four -2 2011\n", "4 C -5 >0 210\n", "5 c NaN NaN 2009\n", "\n", "[6 rows x 4 columns]" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install the pandawash extension, run:\n", "\n", "```\n", "%install_ext https://github.com/takluyver/pandawash/raw/master/pandawash.py\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load_ext pandawash" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first column is supposed to contain a limited set of categories, but the names are inconsistent. `%groupcol` creates a cell like the one below, which you edit to make some values equivalent. Here, we've replaced lowercase `c` with capital `C`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%groupcol s.Category" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "# Modify the mapping below to group equivalent values\n", "s.Category = s.Category.map({\n", " 'A': 'A',\n", " 'B': 'B',\n", " 'C': 'C',\n", " 'c': 'C',\n", "})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the values in our numeric columns aren't proper numbers. `%numbercol` finds which values can't be converted to the specified type (`int` or `float`), and creates code to replace them. You fill in the replacement values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%numbercol s.Wholenum int" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# Fill in the replacements below\n", "s.Wholenum = s.Wholenum\\\n", " .fillna(value=0)\\\n", " .replace('four', '4')\\\n", " .replace('<1', '1')\\\n", " .astype(int)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "%numbercol s.Decimal float" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# Fill in the replacements below\n", "s.Decimal = s.Decimal\\\n", " .replace('>0', '0.1')\\\n", " .astype(float)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the years are clearly unreasonable. Using `%boundscheckcol`, we can clean up anything outside a defined range." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%boundscheckcol s.Year 1900-2100" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "# Fill in the replacements below\n", "s.Year = s.Year\\\n", " .replace(20014, 2014)\\\n", " .replace(210, 2010)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the cleaned up DataFrame. The numeric columns now have appropriate datatypes, so you can easily use them in calculations." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryWholenumDecimalYear
0 A 2 0.6 2014
1 A 1 5.0 2013
2 B 3 4.0 2012
3 B 4-2.0 2011
4 C-5 0.1 2010
5 C 0 NaN 2009
\n", "

6 rows \u00d7 4 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " Category Wholenum Decimal Year\n", "0 A 2 0.6 2014\n", "1 A 1 5.0 2013\n", "2 B 3 4.0 2012\n", "3 B 4 -2.0 2011\n", "4 C -5 0.1 2010\n", "5 C 0 NaN 2009\n", "\n", "[6 rows x 4 columns]" ] } ], "prompt_number": 12 } ], "metadata": {} } ] }