{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:50.844074Z", "start_time": "2019-12-24T17:42:50.026363Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:50.856824Z", "start_time": "2019-12-24T17:42:50.848076Z" } }, "outputs": [], "source": [ "import builtins\n", "\n", "def print(*args, **kwargs):\n", " builtins.print(*args, **kwargs, end='\\n\\n')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling Missing Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value **NaN** (Not a Number) to represent missing data. We call this a **sentinel value** that can be easily detected" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.015075Z", "start_time": "2019-12-24T17:42:50.858430Z" } }, "outputs": [ { "data": { "text/plain": [ "0 aardvark\n", "1 artichoke\n", "2 NaN\n", "3 avocado\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])\n", "string_data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.161311Z", "start_time": "2019-12-24T17:42:51.016109Z" } }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.295057Z", "start_time": "2019-12-24T17:42:51.164353Z" }, "hide_input": false }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.437478Z", "start_time": "2019-12-24T17:42:51.297653Z" } }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The built-in Python None value is also treated as NA in object arrays\n", "\n", "string_data[0] = None\n", "string_data.isnull()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.605133Z", "start_time": "2019-12-24T17:42:51.438477Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NA handling methods\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArgumentDescription
1dropnaFilter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
2fillnaFill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
3isnullReturn boolean values indicating which values are missing/NA.
4notnullNegation of isnull.
\n", "
" ], "text/plain": [ " Argument \\\n", "1 dropna \n", "2 fillna \n", "3 isnull \n", "4 notnull \n", "\n", " Description \n", "1 Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. \n", "2 Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. \n", "3 Return boolean values indicating which values are missing/NA. \n", "4 Negation of isnull. " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.options.display.max_colwidth = 500\n", "df = pd.read_csv(r'examples/NA_handling_methods.csv', names=['Description'], sep='\\n')\n", "df.index = list(range(1, 5))\n", "\n", "def f1(x):\n", " x = x.split(' ')\n", " return x.pop(0)\n", "\n", "\n", "df['Argument'] = df['Description'].map(f1) # get the first word\n", "\n", "def f2(x):\n", " x = x.split(' ')\n", " return \" \".join(x[1:])\n", "\n", "df['Description'] = df['Description'].map(f2) # remove the first word\n", "\n", "df = df.reindex(columns=['Argument', 'Description'])\n", "print(\"NA handling methods\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering Out Missing Data\n", "\n", "There are a few ways to filter out missing data. While you always have the option to do it by hand using **pandas.isnull** and **boolean indexing**, the **dropna** can be helpful." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.746687Z", "start_time": "2019-12-24T17:42:51.607016Z" } }, "outputs": [], "source": [ "# On a Series, it returns the Series with only the non-null data and index values\n", "\n", "from numpy import nan as NA" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.894072Z", "start_time": "2019-12-24T17:42:51.750247Z" } }, "outputs": [], "source": [ "data = pd.Series([1, NA, 3.5, NA, 7])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.036025Z", "start_time": "2019-12-24T17:42:51.896067Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "2 3.5\n", "4 7.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna()\n", "# data[data.notnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by **default drops any row containing a missing value**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.179002Z", "start_time": "2019-12-24T17:42:52.037987Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],\n", " [NA, NA, NA], [NA, 6.5, 3.]])\n", "data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.326562Z", "start_time": "2019-12-24T17:42:52.179002Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cleaned = data.dropna()\n", "cleaned" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.492974Z", "start_time": "2019-12-24T17:42:52.328730Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Passing how='all' will only drop rows that are all NA\n", "\n", "data.dropna(how='all')\n", "# data.dropna(how='any')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.663286Z", "start_time": "2019-12-24T17:42:52.493950Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
\n", "
" ], "text/plain": [ " 0 1 2 4\n", "0 1.0 6.5 3.0 NaN\n", "1 1.0 NaN NaN NaN\n", "2 NaN NaN NaN NaN\n", "3 NaN 6.5 3.0 NaN" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To drop columns in the same way, pass axis=1\n", "\n", "data[4] = NA\n", "data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.828878Z", "start_time": "2019-12-24T17:42:52.665142Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna(axis=1, how='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A related way to filter out DataFrame rows tends to concern time series data. **Suppose you want to keep only rows containing a certain number of observations**. You can indicate this with the **thresh** argument" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.964051Z", "start_time": "2019-12-24T17:42:52.834105Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.196168NaNNaN
1-1.007727NaNNaN
22.909089NaN-2.142449
3-0.866425NaN1.881426
40.300446-1.2668300.296984
5-0.9630501.649808-1.686630
6-0.087371-0.342160-0.608004
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.196168 NaN NaN\n", "1 -1.007727 NaN NaN\n", "2 2.909089 NaN -2.142449\n", "3 -0.866425 NaN 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randn(7, 3))\n", "df.iloc[:4, 1] = NA\n", "df.iloc[:2, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.139880Z", "start_time": "2019-12-24T17:42:52.967617Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004\n", "\n", " 0 1 2\n", "2 2.909089 NaN -2.142449\n", "3 -0.866425 NaN 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004\n", "\n" ] } ], "source": [ "print(df.dropna())\n", "print(df.dropna(thresh=2))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.321362Z", "start_time": "2019-12-24T17:42:53.141874Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.196168NaNNaN
1-1.007727NaNNaN
22.909089NaN-2.142449
3-0.866425NaN1.881426
40.300446-1.2668300.296984
5-0.9630501.649808-1.686630
6-0.087371-0.342160-0.608004
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.196168 NaN NaN\n", "1 -1.007727 NaN NaN\n", "2 2.909089 NaN -2.142449\n", "3 -0.866425 NaN 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.477978Z", "start_time": "2019-12-24T17:42:53.324143Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3, 3)\n", "\n", "(5, 3)\n", "\n" ] } ], "source": [ "print(df.dropna(subset=[1, 2], how='any').shape)\n", "print(df.dropna(subset=[1, 2], how='all').shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filling In Missing Data" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.630300Z", "start_time": "2019-12-24T17:42:53.480393Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.1961680.0000000.000000
1-1.0077270.0000000.000000
22.9090890.000000-2.142449
3-0.8664250.0000001.881426
40.300446-1.2668300.296984
5-0.9630501.649808-1.686630
6-0.087371-0.342160-0.608004
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.196168 0.000000 0.000000\n", "1 -1.007727 0.000000 0.000000\n", "2 2.909089 0.000000 -2.142449\n", "3 -0.866425 0.000000 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.795600Z", "start_time": "2019-12-24T17:42:53.632354Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.1961680.5000000.000000
1-1.0077270.5000000.000000
22.9090890.500000-2.142449
3-0.8664250.5000001.881426
40.300446-1.2668300.296984
5-0.9630501.649808-1.686630
6-0.087371-0.342160-0.608004
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.196168 0.500000 0.000000\n", "1 -1.007727 0.500000 0.000000\n", "2 2.909089 0.500000 -2.142449\n", "3 -0.866425 0.500000 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calling fillna with a dict, you can use a different fill value for each column\n", "\n", "df.fillna({1: 0.5, 2: 0})" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.971110Z", "start_time": "2019-12-24T17:42:53.798627Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.1961680.0000000.000000
1-1.0077270.0000000.000000
22.9090890.000000-2.142449
3-0.8664250.0000001.881426
40.300446-1.2668300.296984
5-0.9630501.649808-1.686630
6-0.087371-0.342160-0.608004
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.196168 0.000000 0.000000\n", "1 -1.007727 0.000000 0.000000\n", "2 2.909089 0.000000 -2.142449\n", "3 -0.866425 0.000000 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fillna returns a new object, but you can modify the existing object in-place\n", "_ = df.fillna(0, inplace=True)\n", "df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.119327Z", "start_time": "2019-12-24T17:42:53.972107Z" } }, "outputs": [], "source": [ "df = pd.DataFrame(np.random.randn(6, 3))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.585826Z", "start_time": "2019-12-24T17:42:54.119327Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.2830631.4560590.639081
1-0.863694-0.012437-0.741003
20.533858NaN-0.741355
31.690786NaN-0.396711
4-0.296663NaNNaN
5-1.422233NaNNaN
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.283063 1.456059 0.639081\n", "1 -0.863694 -0.012437 -0.741003\n", "2 0.533858 NaN -0.741355\n", "3 1.690786 NaN -0.396711\n", "4 -0.296663 NaN NaN\n", "5 -1.422233 NaN NaN" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:, 1] = NA\n", "df.iloc[4:, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.843584Z", "start_time": "2019-12-24T17:42:54.586856Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.2830631.4560590.639081
1-0.863694-0.012437-0.741003
20.533858-0.012437-0.741355
31.690786-0.012437-0.396711
4-0.296663-0.012437-0.396711
5-1.422233-0.012437-0.396711
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.283063 1.456059 0.639081\n", "1 -0.863694 -0.012437 -0.741003\n", "2 0.533858 -0.012437 -0.741355\n", "3 1.690786 -0.012437 -0.396711\n", "4 -0.296663 -0.012437 -0.396711\n", "5 -1.422233 -0.012437 -0.396711" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.997314Z", "start_time": "2019-12-24T17:42:54.846361Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-1.2830631.4560590.639081
1-0.863694-0.012437-0.741003
20.533858-0.012437-0.741355
31.690786-0.012437-0.396711
4-0.296663NaN-0.396711
5-1.422233NaN-0.396711
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -1.283063 1.456059 0.639081\n", "1 -0.863694 -0.012437 -0.741003\n", "2 0.533858 -0.012437 -0.741355\n", "3 1.690786 -0.012437 -0.396711\n", "4 -0.296663 NaN -0.396711\n", "5 -1.422233 NaN -0.396711" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill', limit=2)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.162941Z", "start_time": "2019-12-24T17:42:54.998307Z" } }, "outputs": [], "source": [ "# we might pass mean or median of a series\n", "\n", "data = pd.Series([1., NA, 3.5, NA, 7])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.325659Z", "start_time": "2019-12-24T17:42:55.165933Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1.000000\n", "1 3.833333\n", "2 3.500000\n", "3 3.833333\n", "4 7.000000\n", "dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.fillna(data.mean())" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-23T17:37:04.279897Z", "start_time": "2019-12-23T17:37:04.272219Z" } }, "source": [ "**Arguments** - **Description** for fillna\n", "\n", "**value** - Scalar value or dict-like object to use to fill missing values\n", "\n", "**method** - Interpolation; by default 'ffill' if function called with no other arguments\n", "\n", "**axis** - Axis to fill on; default axis=0\n", "\n", "**inplace** - Modify the calling object without producing a copy\n", "\n", "**limit** - For forward and backward filling, maximum number of consecutive periods to fill" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Transformation\n", "\n", "### Removing Duplicates" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.494690Z", "start_time": "2019-12-24T17:42:55.328594Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n", " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DataFrame method **duplicated** returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.649465Z", "start_time": "2019-12-24T17:42:55.498032Z" } }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 True\n", "dtype: bool" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.duplicated()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.804370Z", "start_time": "2019-12-24T17:42:55.652247Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop_duplicates returns a DataFrame where the duplicated array is False\n", "\n", "data.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.962564Z", "start_time": "2019-12-24T17:42:55.805368Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46
\n", "
" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1\n", "2 one 2 2\n", "3 two 3 3\n", "4 one 3 4\n", "5 two 4 5\n", "6 two 4 6" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Suppose we had an additional column of values and wanted to \n", "# filter duplicates only based on the 'k1' column\n", "\n", "data['v1'] = range(7)\n", "data" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.114393Z", "start_time": "2019-12-24T17:42:55.964622Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2v1
0one10
1two11
\n", "
" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates(['k1'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transforming Data Using a Function or Mapping\n", "\n", "For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.277029Z", "start_time": "2019-12-24T17:42:56.116562Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
\n", "
" ], "text/plain": [ " food ounces\n", "0 bacon 4.0\n", "1 pulled pork 3.0\n", "2 bacon 12.0\n", "3 Pastrami 6.0\n", "4 corned beef 7.5\n", "5 Bacon 8.0\n", "6 pastrami 3.0\n", "7 honey ham 5.0\n", "8 nova lox 6.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',\n", " 'Pastrami', 'corned beef', 'Bacon',\n", " 'pastrami', 'honey ham', 'nova lox'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.417657Z", "start_time": "2019-12-24T17:42:56.278980Z" } }, "outputs": [], "source": [ "# Suppose you wanted to add a column indicating the \n", "# type of animal that each food came from\n", "\n", "meat_to_animal = {\n", " 'bacon': 'pig',\n", " 'pulled pork': 'pig',\n", " 'pastrami': 'cow',\n", " 'corned beef': 'cow',\n", " 'honey ham': 'pig',\n", " 'nova lox': 'salmon'\n", "}" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.595457Z", "start_time": "2019-12-24T17:42:56.419812Z" } }, "outputs": [ { "data": { "text/plain": [ "0 bacon\n", "1 pulled pork\n", "2 bacon\n", "3 pastrami\n", "4 corned beef\n", "5 bacon\n", "6 pastrami\n", "7 honey ham\n", "8 nova lox\n", "Name: food, dtype: object" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lowercased = data['food'].str.lower()\n", "lowercased" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.767759Z", "start_time": "2019-12-24T17:42:56.597643Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
\n", "
" ], "text/plain": [ " food ounces animal\n", "0 bacon 4.0 pig\n", "1 pulled pork 3.0 pig\n", "2 bacon 12.0 pig\n", "3 Pastrami 6.0 cow\n", "4 corned beef 7.5 cow\n", "5 Bacon 8.0 pig\n", "6 pastrami 3.0 cow\n", "7 honey ham 5.0 pig\n", "8 nova lox 6.0 salmon" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['animal'] = lowercased.map(meat_to_animal)\n", "# or\n", "# data['food'].map(lambda x: meat_to_animal[x.lower()])\n", "data" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-23T17:56:05.001259Z", "start_time": "2019-12-23T17:56:04.992428Z" } }, "source": [ "### Replacing Values\n", "\n", "Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.900004Z", "start_time": "2019-12-24T17:42:56.768790Z" } }, "outputs": [], "source": [ "data = pd.Series([1., -999., 2., -999., -1000., 3.])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.091197Z", "start_time": "2019-12-24T17:42:56.900004Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 -999.0\n", "2 2.0\n", "3 -999.0\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.234609Z", "start_time": "2019-12-24T17:42:57.092227Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64\n", "\n", "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 NaN\n", "5 3.0\n", "dtype: float64\n", "\n", "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 0.0\n", "5 3.0\n", "dtype: float64\n", "\n", "None\n", "\n" ] } ], "source": [ "print(data.replace(-999, np.nan))\n", "print(data.replace([-999, -1000], np.nan))\n", "print(data.replace([-999, -1000], [np.nan, 0]))\n", "print(data.replace({-999: np.nan, -1000: 0}, inplace=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming Axis Indexes\n", "\n", "axis labels can be transformed by a function or mapping of some form to produce new, differently labeled objects. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.347390Z", "start_time": "2019-12-24T17:42:57.234609Z" } }, "outputs": [], "source": [ "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n", " index=['Ohio', 'Colorado', 'New York'],\n", " columns=['one', 'two', 'three', 'four'])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.486408Z", "start_time": "2019-12-24T17:42:57.349987Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['OHIO', 'COLO', 'NEW '], dtype='object')\n", "\n", "Index(['OH', 'CO', 'NE'], dtype='object')\n", "\n" ] } ], "source": [ "transform = lambda x: x[:4].upper()\n", "\n", "# Like a Series, the axis indexes have a map method\n", "\n", "print(data.index.map(transform))\n", "\n", "# -----------\n", "trans = lambda x: x[:2].upper()\n", "print(data.index.map(trans))" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.630459Z", "start_time": "2019-12-24T17:42:57.488063Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
OHIO0123
COLO4567
NEW891011
\n", "
" ], "text/plain": [ " one two three four\n", "OHIO 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index = data.index.map(transform)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to create a transformed version of a dataset without modifying the original, a useful method is **rename**" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.779949Z", "start_time": "2019-12-24T17:42:57.631489Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
\n", "
" ], "text/plain": [ " ONE TWO THREE FOUR\n", "Ohio 0 1 2 3\n", "Colo 4 5 6 7\n", "New 8 9 10 11" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index=str.title, columns=str.upper)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.934232Z", "start_time": "2019-12-24T17:42:57.782092Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwopeekaboofour
INDIANA0123
COLO4567
NEW891011
\n", "
" ], "text/plain": [ " one two peekaboo four\n", "INDIANA 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rename can be used in conjunction with a dict-like object\n", "# providing new values for a subset of the axis labels\n", "\n", "data.rename(index={'OHIO': 'INDIANA'},\n", " columns={'three': 'peekaboo'}, inplace=True)\n", "data" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.052890Z", "start_time": "2019-12-24T17:42:57.934232Z" } }, "outputs": [], "source": [ "# ToD: Discretization and Binning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting and Filtering Outliers\n", "\n", "Filtering or transforming outliers is largely a matter of applying array operations." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.224509Z", "start_time": "2019-12-24T17:42:58.052890Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.0153830.0248960.038430-0.064249
std1.0083770.9964690.9898670.995204
min-3.141061-3.430585-2.886733-3.260575
25%-0.708238-0.597616-0.656763-0.766798
50%0.0513030.0514380.048483-0.071339
75%0.7307750.6919720.6665960.639310
max3.7669053.3498573.3616742.854713
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean 0.015383 0.024896 0.038430 -0.064249\n", "std 1.008377 0.996469 0.989867 0.995204\n", "min -3.141061 -3.430585 -2.886733 -3.260575\n", "25% -0.708238 -0.597616 -0.656763 -0.766798\n", "50% 0.051303 0.051438 0.048483 -0.071339\n", "75% 0.730775 0.691972 0.666596 0.639310\n", "max 3.766905 3.349857 3.361674 2.854713" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame(np.random.randn(1000, 4))\n", "data.describe()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.337732Z", "start_time": "2019-12-24T17:42:58.225600Z" } }, "outputs": [], "source": [ "col = data[0]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.470172Z", "start_time": "2019-12-24T17:42:58.345580Z" } }, "outputs": [ { "data": { "text/plain": [ "1 -3.141061\n", "56 3.766905\n", "247 3.008167\n", "787 -3.135201\n", "Name: 0, dtype: float64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col[np.abs(col) > 3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.634225Z", "start_time": "2019-12-24T17:42:58.473682Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
1-3.1410610.7831270.469851-0.395636
563.7669050.0309091.6334941.140207
2473.008167-0.9928880.260335-3.063802
327-0.0745633.3498572.032817-0.609838
405-0.362679-0.1031593.0873572.517252
448-0.4374643.054526-0.2544061.047457
611-0.369185-3.430585-0.515431-1.668876
619-0.3112820.5355453.039966-1.199799
631-0.9237380.346249-0.485636-3.260575
765-0.856659-0.717638-1.155797-3.069395
787-3.135201-1.338375-0.4041711.373373
9170.437193-3.026429-0.4384330.024585
984-0.6297880.3080673.361674-1.683448
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "1 -3.141061 0.783127 0.469851 -0.395636\n", "56 3.766905 0.030909 1.633494 1.140207\n", "247 3.008167 -0.992888 0.260335 -3.063802\n", "327 -0.074563 3.349857 2.032817 -0.609838\n", "405 -0.362679 -0.103159 3.087357 2.517252\n", "448 -0.437464 3.054526 -0.254406 1.047457\n", "611 -0.369185 -3.430585 -0.515431 -1.668876\n", "619 -0.311282 0.535545 3.039966 -1.199799\n", "631 -0.923738 0.346249 -0.485636 -3.260575\n", "765 -0.856659 -0.717638 -1.155797 -3.069395\n", "787 -3.135201 -1.338375 -0.404171 1.373373\n", "917 0.437193 -3.026429 -0.438433 0.024585\n", "984 -0.629788 0.308067 3.361674 -1.683448" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(np.abs(data) > 3).any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.768926Z", "start_time": "2019-12-24T17:42:58.635223Z" } }, "outputs": [], "source": [ "data[np.abs(data) > 3] = np.sign(data) * 3" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.935675Z", "start_time": "2019-12-24T17:42:58.768926Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.0148850.0249490.037941-0.063855
std1.0049370.9937110.9883300.994002
min-3.000000-3.000000-2.886733-3.000000
25%-0.708238-0.597616-0.656763-0.766798
50%0.0513030.0514380.048483-0.071339
75%0.7307750.6919720.6665960.639310
max3.0000003.0000003.0000002.854713
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean 0.014885 0.024949 0.037941 -0.063855\n", "std 1.004937 0.993711 0.988330 0.994002\n", "min -3.000000 -3.000000 -2.886733 -3.000000\n", "25% -0.708238 -0.597616 -0.656763 -0.766798\n", "50% 0.051303 0.051438 0.048483 -0.071339\n", "75% 0.730775 0.691972 0.666596 0.639310\n", "max 3.000000 3.000000 3.000000 2.854713" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.066787Z", "start_time": "2019-12-24T17:42:58.936677Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
0-1.0-1.0-1.0-1.0
1-1.01.01.0-1.0
21.0-1.01.0-1.0
31.0-1.01.0-1.0
41.0-1.0-1.01.0
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 -1.0 -1.0 -1.0 -1.0\n", "1 -1.0 1.0 1.0 -1.0\n", "2 1.0 -1.0 1.0 -1.0\n", "3 1.0 -1.0 1.0 -1.0\n", "4 1.0 -1.0 -1.0 1.0" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The statement np.sign(data) produces 1 and –1 values based on \n", "# whether the values in data are positive or negative\n", "\n", "np.sign(data).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Permutation and Random Sampling\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.221723Z", "start_time": "2019-12-24T17:42:59.067821Z" } }, "outputs": [], "source": [ "df = pd.DataFrame(np.arange(20).reshape((5, 4)))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.370484Z", "start_time": "2019-12-24T17:42:59.222721Z" } }, "outputs": [ { "data": { "text/plain": [ "array([3, 4, 1, 2, 0])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampler = np.random.permutation(5)\n", "sampler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That array can then be used in iloc-based indexing or the equivalent **take** function" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.559900Z", "start_time": "2019-12-24T17:42:59.374481Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
00123
14567
2891011
312131415
416171819
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15\n", "4 16 17 18 19" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.714703Z", "start_time": "2019-12-24T17:42:59.562894Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
312131415
416171819
14567
2891011
00123
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "3 12 13 14 15\n", "4 16 17 18 19\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "0 0 1 2 3" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.take(sampler)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select a random subset **without replacement**, you can use the **sample** method on Series and DataFrame" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.877470Z", "start_time": "2019-12-24T17:42:59.717464Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
416171819
2891011
14567
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "4 16 17 18 19\n", "2 8 9 10 11\n", "1 4 5 6 7" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(n=3)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:00.022827Z", "start_time": "2019-12-24T17:42:59.880463Z" } }, "outputs": [ { "data": { "text/plain": [ "3 6\n", "1 7\n", "3 6\n", "0 5\n", "3 6\n", "0 5\n", "1 7\n", "2 -1\n", "2 -1\n", "4 4\n", "dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sample with replacement\n", "\n", "choices = pd.Series([5, 7, -1, 6, 4])\n", "draws = choices.sample(n=10, replace=True)\n", "draws" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing Indicator/Dummy Variables\n" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.237733Z", "start_time": "2019-12-24T17:43:00.025854Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.266505Z", "start_time": "2019-12-24T17:43:01.239975Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_male
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS1
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Sex_male \n", "0 0 A/5 21171 7.2500 NaN S 1 \n", "1 0 PC 17599 71.2833 C85 C 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 \n", "3 0 113803 53.1000 C123 S 0 \n", "4 0 373450 8.0500 NaN S 1 " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a dummy column for Sex column\n", "\n", "# using map method\n", "train['Sex_male'] = train.Sex.map({'female': 0, 'male': 1})\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.411456Z", "start_time": "2019-12-24T17:43:01.268780Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
femalemale
001
110
210
310
401
.........
88601
88710
88810
88901
89001
\n", "

891 rows × 2 columns

\n", "
" ], "text/plain": [ " female male\n", "0 0 1\n", "1 1 0\n", "2 1 0\n", "3 1 0\n", "4 0 1\n", ".. ... ...\n", "886 0 1\n", "887 1 0\n", "888 1 0\n", "889 0 1\n", "890 0 1\n", "\n", "[891 rows x 2 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# using get_dummies\n", " \n", "pd.get_dummies(train.Sex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally, **k** categorical values, then use **k-1** dummy variables to represent it.\n", "\n", "for Sex, we have to categorical values so we need only 1 variable to encode all the given information. So we drop the first column." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.573467Z", "start_time": "2019-12-24T17:43:01.414109Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sex_male
01
10
20
30
41
......
8861
8870
8880
8891
8901
\n", "

891 rows × 1 columns

\n", "
" ], "text/plain": [ " Sex_male\n", "0 1\n", "1 0\n", "2 0\n", "3 0\n", "4 1\n", ".. ...\n", "886 1\n", "887 0\n", "888 0\n", "889 1\n", "890 1\n", "\n", "[891 rows x 1 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.712385Z", "start_time": "2019-12-24T17:43:01.577414Z" } }, "outputs": [ { "data": { "text/plain": [ "S 644\n", "C 168\n", "Q 77\n", "Name: Embarked, dtype: int64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.Embarked.value_counts()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.845962Z", "start_time": "2019-12-24T17:43:01.712385Z" } }, "outputs": [], "source": [ "embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.023567Z", "start_time": "2019-12-24T17:43:01.845962Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_maleEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS101
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C000
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S001
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS101
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Sex_male Embarked_Q \\\n", "0 0 A/5 21171 7.2500 NaN S 1 0 \n", "1 0 PC 17599 71.2833 C85 C 0 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 0 \n", "3 0 113803 53.1000 C123 S 0 0 \n", "4 0 373450 8.0500 NaN S 1 0 \n", "\n", " Embarked_S \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = pd.concat([train, embarked_dummies], axis=1)\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.754062Z", "start_time": "2019-12-24T17:43:02.024777Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameAgeSibSpParchTicketFareCabinSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harris22.010A/5 211717.2500NaN01001
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)38.010PC 1759971.2833C8510100
2313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250NaN10001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000C12310001
4503Allen, Mr. William Henry35.0003734508.0500NaN01001
................................................
88688702Montvila, Rev. Juozas27.00021153613.0000NaN01001
88788811Graham, Miss. Margaret Edith19.00011205330.0000B4210001
88888903Johnston, Miss. Catherine Helen \"Carrie\"NaN12W./C. 660723.4500NaN10001
88989011Behr, Mr. Karl Howell26.00011136930.0000C14801100
89089103Dooley, Mr. Patrick32.0003703767.7500NaN01010
\n", "

891 rows × 15 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Age SibSp Parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "4 Allen, Mr. William Henry 35.0 0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas 27.0 0 0 \n", "887 Graham, Miss. Margaret Edith 19.0 0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" NaN 1 2 \n", "889 Behr, Mr. Karl Howell 26.0 0 0 \n", "890 Dooley, Mr. Patrick 32.0 0 0 \n", "\n", " Ticket Fare Cabin Sex_female Sex_male Embarked_C \\\n", "0 A/5 21171 7.2500 NaN 0 1 0 \n", "1 PC 17599 71.2833 C85 1 0 1 \n", "2 STON/O2. 3101282 7.9250 NaN 1 0 0 \n", "3 113803 53.1000 C123 1 0 0 \n", "4 373450 8.0500 NaN 0 1 0 \n", ".. ... ... ... ... ... ... \n", "886 211536 13.0000 NaN 0 1 0 \n", "887 112053 30.0000 B42 1 0 0 \n", "888 W./C. 6607 23.4500 NaN 1 0 0 \n", "889 111369 30.0000 C148 0 1 1 \n", "890 370376 7.7500 NaN 0 1 0 \n", "\n", " Embarked_Q Embarked_S \n", "0 0 1 \n", "1 0 0 \n", "2 0 1 \n", "3 0 1 \n", "4 0 1 \n", ".. ... ... \n", "886 0 1 \n", "887 0 1 \n", "888 0 1 \n", "889 0 0 \n", "890 1 0 \n", "\n", "[891 rows x 15 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pass columns to get_dummies makes it more easy\n", "\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "\n", "pd.get_dummies(train, columns=['Sex', 'Embarked'])" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.783417Z", "start_time": "2019-12-24T17:43:02.756504Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameAgeSibSpParchTicketFareCabinSex_maleEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harris22.010A/5 211717.2500NaN101
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)38.010PC 1759971.2833C85000
2313Heikkinen, Miss. Laina26.000STON/O2. 31012827.9250NaN001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01011380353.1000C123001
4503Allen, Mr. William Henry35.0003734508.0500NaN101
..........................................
88688702Montvila, Rev. Juozas27.00021153613.0000NaN101
88788811Graham, Miss. Margaret Edith19.00011205330.0000B42001
88888903Johnston, Miss. Catherine Helen \"Carrie\"NaN12W./C. 660723.4500NaN001
88989011Behr, Mr. Karl Howell26.00011136930.0000C148100
89089103Dooley, Mr. Patrick32.0003703767.7500NaN110
\n", "

891 rows × 13 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Age SibSp Parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "4 Allen, Mr. William Henry 35.0 0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas 27.0 0 0 \n", "887 Graham, Miss. Margaret Edith 19.0 0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" NaN 1 2 \n", "889 Behr, Mr. Karl Howell 26.0 0 0 \n", "890 Dooley, Mr. Patrick 32.0 0 0 \n", "\n", " Ticket Fare Cabin Sex_male Embarked_Q Embarked_S \n", "0 A/5 21171 7.2500 NaN 1 0 1 \n", "1 PC 17599 71.2833 C85 0 0 0 \n", "2 STON/O2. 3101282 7.9250 NaN 0 0 1 \n", "3 113803 53.1000 C123 0 0 1 \n", "4 373450 8.0500 NaN 1 0 1 \n", ".. ... ... ... ... ... ... \n", "886 211536 13.0000 NaN 1 0 1 \n", "887 112053 30.0000 B42 0 0 1 \n", "888 W./C. 6607 23.4500 NaN 0 0 1 \n", "889 111369 30.0000 C148 1 0 0 \n", "890 370376 7.7500 NaN 1 1 0 \n", "\n", "[891 rows x 13 columns]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop the first column after get_dummies\n", "\n", "pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.991607Z", "start_time": "2019-12-24T17:43:02.784456Z" } }, "outputs": [], "source": [ "# example of MovieLens 1M dataset\n", "\n", "mnames = ['movieid', 'title', 'genres']" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.179736Z", "start_time": "2019-12-24T17:43:02.993603Z" } }, "outputs": [], "source": [ "movies = pd.read_table(r'MovieLens-1M/movies.dat', sep='::', header=None,\n", " names=mnames, engine='python')" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.445851Z", "start_time": "2019-12-24T17:43:03.182662Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieidtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
\n", "
" ], "text/plain": [ " movieid title genres\n", "0 1 Toy Story (1995) Animation|Children's|Comedy\n", "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n", "2 3 Grumpier Old Men (1995) Comedy|Romance\n", "3 4 Waiting to Exhale (1995) Comedy|Drama\n", "4 5 Father of the Bride Part II (1995) Comedy" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.head()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.606208Z", "start_time": "2019-12-24T17:43:03.447874Z" } }, "outputs": [], "source": [ "all_genres = []\n", "\n", "for x in movies.genres:\n", " all_genres.extend(x.split('|'))" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.741670Z", "start_time": "2019-12-24T17:43:03.609697Z" } }, "outputs": [ { "data": { "text/plain": [ "array(['Animation', \"Children's\", 'Comedy', 'Adventure', 'Fantasy',\n", " 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',\n", " 'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',\n", " 'Western'], dtype=object)" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genres = pd.unique(all_genres)\n", "genres" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.877471Z", "start_time": "2019-12-24T17:43:03.745292Z" } }, "outputs": [], "source": [ "# One way to construct the indicator DataFrame is to start with a\n", "# DataFrame of allzeros\n", "\n", "zero_matrix = np.zeros((len(movies), len(genres)))" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:04.014237Z", "start_time": "2019-12-24T17:43:03.881359Z" } }, "outputs": [], "source": [ "dummies = pd.DataFrame(zero_matrix, columns=genres)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the dummies.columns to compute the column indices for each genre" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:04.141318Z", "start_time": "2019-12-24T17:43:04.016094Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Animation', \"Children's\", 'Comedy']\n", "\n" ] }, { "data": { "text/plain": [ "array([0, 1, 2], dtype=int64)" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = movies.genres[0]\n", "print(gen.split('|'))\n", "dummies.columns.get_indexer(gen.split('|'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can use .iloc to set values based on these indices" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:06.917275Z", "start_time": "2019-12-24T17:43:04.144485Z" } }, "outputs": [], "source": [ "for i, gen in enumerate(movies.genres):\n", " indices = dummies.columns.get_indexer(gen.split('|'))\n", " dummies.iloc[i, indices] = 1" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:06.932929Z", "start_time": "2019-12-24T17:43:06.917275Z" } }, "outputs": [], "source": [ "# join with movies\n", "\n", "movies_windic = movies.join(dummies.add_prefix('Genre_'))" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.061469Z", "start_time": "2019-12-24T17:43:06.935434Z" } }, "outputs": [ { "data": { "text/plain": [ "movieid 1\n", "title Toy Story (1995)\n", "genres Animation|Children's|Comedy\n", "Genre_Animation 1\n", "Genre_Children's 1\n", "Genre_Comedy 1\n", "Genre_Adventure 0\n", "Genre_Fantasy 0\n", "Genre_Romance 0\n", "Genre_Drama 0\n", "Genre_Action 0\n", "Genre_Crime 0\n", "Genre_Thriller 0\n", "Genre_Horror 0\n", "Genre_Sci-Fi 0\n", "Genre_Documentary 0\n", "Genre_War 0\n", "Genre_Musical 0\n", "Genre_Mystery 0\n", "Genre_Film-Noir 0\n", "Genre_Western 0\n", "Name: 0, dtype: object" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_windic.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Manipulation\n", "### Basic String methods" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.209653Z", "start_time": "2019-12-24T17:43:07.063739Z" } }, "outputs": [ { "data": { "text/plain": [ "['a', 'b', 'guido']" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val = 'a,b, guido'\n", "\n", "pieces = [x.strip() for x in val.split(',')]\n", "pieces" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.361358Z", "start_time": "2019-12-24T17:43:07.210654Z" } }, "outputs": [ { "data": { "text/plain": [ "'a::b::guido'" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"::\".join(pieces)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**built-in String methods**\n", "\n", "|Argument | Description |\n", "|---|---|\n", "|count|Return the number of non-overlapping occurrences of substring in the string. |\n", "|endswith|Returns True if string ends with suffix. |\n", "|startswith|Returns True if string starts with prefix. |\n", "|join|Use string as delimiter for concatenating a sequence of other strings. |\n", "|index|Return position of first character in substring if found in the string; raises ValueError if not found. |\n", "|find|Return position of first character of €rst occurrence of substring in the string; like index, but returns –1 if not found.|\n", "|rfind|Return position of first character of last occurrence of substring in the string; returns –1 if not found. |\n", "|replace|Replace occurrences of string with another string.|\n", "|strip, rstrip, lstrip|Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively)\n", " for each element|\n", "|split|Break string into list of substrings using passed delimiter. |\n", "|lower|Convert alphabet characters to lowercase. |\n", "|upper|Convert alphabet characters to uppercase. |\n", "|casefold|Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form. \n", "|ljust, rjust|Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regular Expressions" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-24T13:32:20.008167Z", "start_time": "2019-12-24T13:32:19.995045Z" } }, "source": [ "The re module functions fall into three categories: pattern matching, substitution,\n", "and splitting. Naturally these are all related; a regex describes a pattern to locate in the\n", "text, which can then be used for many purposes." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.495362Z", "start_time": "2019-12-24T17:43:07.367465Z" } }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.637104Z", "start_time": "2019-12-24T17:43:07.495362Z" } }, "outputs": [ { "data": { "text/plain": [ "'foo bar\\tbaz \\tqux'" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"foo bar\\tbaz \\tqux\"\n", "text" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.780343Z", "start_time": "2019-12-24T17:43:07.638102Z" } }, "outputs": [ { "data": { "text/plain": [ "['foo', 'bar', 'baz', 'qux']" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('\\s+', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you call re.split('\\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with **re.compile**, forming a reusable regex object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a regex object with re.compile is highly recommended if you intend to\n", "apply the same expression to many strings; doing so will save CPU cycles" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.916446Z", "start_time": "2019-12-24T17:43:07.785420Z" } }, "outputs": [], "source": [ "regex = re.compile('\\s+')" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.056820Z", "start_time": "2019-12-24T17:43:07.920407Z" } }, "outputs": [ { "data": { "text/plain": [ "['foo', 'bar', 'baz', 'qux']" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.split(text)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.205267Z", "start_time": "2019-12-24T17:43:08.058813Z" } }, "outputs": [], "source": [ "text = \"\"\"Dave dave@google.com\n", "Steve steve@gmail.com\n", "Rob rob@gmail.com\n", "Ryan ryan@yahoo.com\n", "\"\"\"\n", "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n", "\n", "# re.IGNORECASE makes the regex case-insensitive\n", "regex = re.compile(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.365332Z", "start_time": "2019-12-24T17:43:08.212747Z" } }, "outputs": [ { "data": { "text/plain": [ "['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**match** and **search** are closely related to findall. While findall returns all matches\n", "in a string, search returns only the first match. More rigidly, match only matches at\n", "the beginning of the string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**search** returns a special match object for the first email address in the text. For the\n", "preceding regex, the match object can only tell us the start and end position of the\n", "pattern in the string" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.529015Z", "start_time": "2019-12-24T17:43:08.366332Z" } }, "outputs": [], "source": [ "m = regex.search(text)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.991732Z", "start_time": "2019-12-24T17:43:08.529015Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.132275Z", "start_time": "2019-12-24T17:43:08.994110Z" } }, "outputs": [ { "data": { "text/plain": [ "'dave@google.com'" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[m.start(): m.end()]" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.248262Z", "start_time": "2019-12-24T17:43:09.133848Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "\n" ] } ], "source": [ "print(regex.match(text))" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.404466Z", "start_time": "2019-12-24T17:43:09.253311Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dave REDACTED\n", "Steve REDACTED\n", "Rob REDACTED\n", "Ryan REDACTED\n", "\n", "\n" ] } ], "source": [ "print(regex.sub('REDACTED', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " segment each address into its three components: username, domain name, and domain suffix" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.547000Z", "start_time": "2019-12-24T17:43:09.407943Z" } }, "outputs": [], "source": [ "pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.688577Z", "start_time": "2019-12-24T17:43:09.548001Z" } }, "outputs": [], "source": [ "regex = re.compile(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.805024Z", "start_time": "2019-12-24T17:43:09.692210Z" } }, "outputs": [ { "data": { "text/plain": [ "('wesm', 'bright', 'net')" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = regex.match('wesm@bright.net')\n", "m.groups()" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.942973Z", "start_time": "2019-12-24T17:43:09.811263Z" } }, "outputs": [ { "data": { "text/plain": [ "[('dave', 'google', 'com'),\n", " ('steve', 'gmail', 'com'),\n", " ('rob', 'gmail', 'com'),\n", " ('ryan', 'yahoo', 'com')]" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.101665Z", "start_time": "2019-12-24T17:43:09.947203Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dave Username: dave, Domain: google, Suffix: com\n", "Steve Username: steve, Domain: gmail, Suffix: com\n", "Rob Username: rob, Domain: gmail, Suffix: com\n", "Ryan Username: ryan, Domain: yahoo, Suffix: com\n", "\n", "\n" ] } ], "source": [ "print(regex.sub(r'Username: \\1, Domain: \\2, Suffix: \\3', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Regular expression methods**\n", "\n", "|Argument| Description|\n", "|-|-|\n", "|findall| Return all non-overlapping matching patterns in a string as a list|\n", "|finditer| Like findall, but returns an iterator|\n", "|match |Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None|\n", "|search| Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning|\n", "|split| Break string into pieces at each occurrence of pattern|\n", "|sub, subn|Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \\1, \\2, ... to refer to match group elements in the replacement string|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorized String Functions in pandas" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.245311Z", "start_time": "2019-12-24T17:43:10.105666Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave dave@google.com\n", "Steve steve@gmail.com\n", "Rob rob@gmail.com\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',\n", " 'Rob': 'rob@gmail.com', 'Wes': np.nan}\n", "\n", "data = pd.Series(data)\n", "data" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.380107Z", "start_time": "2019-12-24T17:43:10.246309Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Steve False\n", "Rob False\n", "Wes True\n", "dtype: bool" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example," ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.532755Z", "start_time": "2019-12-24T17:43:10.381108Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Steve True\n", "Rob True\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str.contains('gmail')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regular expressions can be used, too, along with any re options like IGNORECASE" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.680071Z", "start_time": "2019-12-24T17:43:10.536520Z" } }, "outputs": [ { "data": { "text/plain": [ "'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\\\.([A-Z]{2,4})'" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.819319Z", "start_time": "2019-12-24T17:43:10.684214Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave [(dave, google, com)]\n", "Steve [(steve, gmail, com)]\n", "Rob [(rob, gmail, com)]\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str.findall(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.964110Z", "start_time": "2019-12-24T17:43:10.824320Z" } }, "outputs": [], "source": [ " matches = data.str.match(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a couple of ways to do vectorized element retrieval. Either use **str.get** or index into the str attribute" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:11.118070Z", "start_time": "2019-12-24T17:43:10.965112Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave True\n", "Steve True\n", "Rob True\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matches" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }