{ "metadata": { "name": "", "signature": "sha256:b2597ea4263c11dd6774b227e7a3a5626197c4863e6895002657fd55d02b55d9" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to python_reference](https://github.com/rasbt/python_reference)]" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load_ext watermark" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "%watermark -v -p numpy -d -u" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Last updated: 31/07/2014 \n", "\n", "CPython 3.4.1\n", "IPython 2.1.0\n", "\n", "numpy 1.8.1\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Quick guide for dealing with missing numbers in NumPy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is just a quick overview of how to deal with missing values (i.e., \"NaN\"s for \"Not-a-Number\") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!\n", "\n", "I would be happy to hear your comments and suggestions. \n", "Please feel free to drop me a note via\n", "[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Sections" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Sample data from a CSV file](#Sample-data-from-a-CSV-file)\n", "- [Determining if a value is missing](#Determining-if-a-value-is-missing)\n", "- [Counting the number of missing values](#Counting-the-number-of-missing-values)\n", "- [Calculating the sum of an array that contains NaNs](#Calculating the sum of an array that contains NaNs)\n", "- [Removing all rows that contain missing values](#Removing-all-rows-that-contain-missing-values)\n", "- [Convert missing values to 0](#Convert-missing-values-to-0)\n", "- [Converting certain numbers to NaN](#Converting-certain-numbers-to-NaN)\n", "- [Remove all missing elements from an array](#Remove-all-missing-elements-from-an-array)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Sample data from a CSV file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's assume that we have a CSV file with missing elements like the one shown below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file example.csv\n", "1,2,3,4\n", "5,6,,8\n", "10,11,12," ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Writing example.csv\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `np.genfromtxt` function has a `missing_values` parameters which translates missing values into `np.nan` objects by default. This allows us to construct a new NumPy `ndarray` object, even if elements are missing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "ary = np.genfromtxt('./example.csv', delimiter=',')\n", "\n", "print('%s x %s array:\\n' %(ary.shape[0], ary.shape[1]))\n", "print(ary)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 x 4 array:\n", "\n", "[[ 1. 2. 3. 4.]\n", " [ 5. 6. nan 8.]\n", " [ 10. 11. 12. nan]]\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Determining if a value is missing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A handy function to test whether a value is a `NaN` or not is to use the `np.isnan` function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.isnan(np.nan)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "True" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is especially useful to create boolean masks for the so-called \"fancy indexing\" of NumPy arrays, which we will come back to later." ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.isnan(ary)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "array([[False, False, False, False],\n", " [False, False, True, False],\n", " [False, False, False, True]], dtype=bool)" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Counting the number of missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to find out how many elements are missing in our array, we can use the `np.isnan` function that we have seen in the previous section. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.count_nonzero(np.isnan(ary))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "2" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to determine the number of non-missing elements, we can simply revert the returned `Boolean` mask via the handy \"tilde\" sign." ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.count_nonzero(~np.isnan(ary))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "10" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Calculating the sum of an array that contains `NaN`s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we will find out via the following code snippet, we can't use NumPy's regular `sum` function to calculate the sum of an array." ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.sum(ary)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "nan" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the `np.sum` function does not work, use `np.nansum` instead:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print('total sum:', np.nansum(ary))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total sum: 62.0\n" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "print('column sums:', np.nansum(ary, axis=0))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "column sums: [ 16. 19. 15. 12.]\n" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "print('row sums:', np.nansum(ary, axis=1))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "row sums: [ 10. 19. 33.]\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Removing all rows that contain missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we will use the `Boolean mask` again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain `NaN`s, we could simply drop the `~`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ary[~np.isnan(ary).any(1)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "array([[ 1., 2., 3., 4.]])" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Convert missing values to 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Certain operations, algorithms, and other analyses might not work with `NaN` objects in our data array. But that's not a problem: We can use the convenient `np.nan_to_num` function will convert it to the value 0." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ary0 = np.nan_to_num(ary)\n", "ary0" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "array([[ 1., 2., 3., 4.],\n", " [ 5., 6., 0., 8.],\n", " [ 10., 11., 12., 0.]])" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Converting certain numbers to NaN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vice versa, we can also convert any number to a `np.NaN` object. Here, we use the array that we created in the previous section and convert the `0`s back to `np.nan` objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ary0[ary0==0] = np.nan\n", "ary0" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "array([[ 1., 2., 3., 4.],\n", " [ 5., 6., nan, 8.],\n", " [ 10., 11., 12., nan]])" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Remove all missing elements from an array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is one is a little bit more tricky. We can remove missing values via a combination of the `Boolean` mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ary[~np.isnan(ary)]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "array([ 1., 2., 3., 4., 5., 6., 8., 10., 11., 12.])" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, this is a method that would better work on individual rows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.array([1,2,np.nan])\n", "\n", "x[~np.isnan(np.array(x))]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "array([ 1., 2.])" ] } ], "prompt_number": 21 } ], "metadata": {} } ] }