{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n", "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Handling-Missing-Data\" data-toc-modified-id=\"Handling-Missing-Data-1\"><span class=\"toc-item-num\">1 </span>Handling Missing Data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Filtering-Out-Missing-Data\" data-toc-modified-id=\"Filtering-Out-Missing-Data-1.1\"><span class=\"toc-item-num\">1.1 </span>Filtering Out Missing Data</a></span></li><li><span><a href=\"#Filling-In-Missing-Data\" data-toc-modified-id=\"Filling-In-Missing-Data-1.2\"><span class=\"toc-item-num\">1.2 </span>Filling In Missing Data</a></span></li></ul></li><li><span><a href=\"#Data-Transformation\" data-toc-modified-id=\"Data-Transformation-2\"><span class=\"toc-item-num\">2 </span>Data Transformation</a></span><ul class=\"toc-item\"><li><span><a href=\"#Removing-Duplicates\" data-toc-modified-id=\"Removing-Duplicates-2.1\"><span class=\"toc-item-num\">2.1 </span>Removing Duplicates</a></span></li><li><span><a href=\"#Transforming-Data-Using-a-Function-or-Mapping\" data-toc-modified-id=\"Transforming-Data-Using-a-Function-or-Mapping-2.2\"><span class=\"toc-item-num\">2.2 </span>Transforming Data Using a Function or Mapping</a></span></li><li><span><a href=\"#Replacing-Values\" data-toc-modified-id=\"Replacing-Values-2.3\"><span class=\"toc-item-num\">2.3 </span>Replacing Values</a></span></li><li><span><a href=\"#Renaming-Axis-Indexes\" data-toc-modified-id=\"Renaming-Axis-Indexes-2.4\"><span class=\"toc-item-num\">2.4 </span>Renaming Axis Indexes</a></span></li><li><span><a href=\"#Detecting-and-Filtering-Outliers\" data-toc-modified-id=\"Detecting-and-Filtering-Outliers-2.5\"><span class=\"toc-item-num\">2.5 </span>Detecting and Filtering Outliers</a></span></li><li><span><a href=\"#Permutation-and-Random-Sampling\" data-toc-modified-id=\"Permutation-and-Random-Sampling-2.6\"><span class=\"toc-item-num\">2.6 </span>Permutation and Random Sampling</a></span></li><li><span><a href=\"#Computing-Indicator/Dummy-Variables\" data-toc-modified-id=\"Computing-Indicator/Dummy-Variables-2.7\"><span class=\"toc-item-num\">2.7 </span>Computing Indicator/Dummy Variables</a></span></li></ul></li><li><span><a href=\"#String-Manipulation\" data-toc-modified-id=\"String-Manipulation-3\"><span class=\"toc-item-num\">3 </span>String Manipulation</a></span><ul class=\"toc-item\"><li><span><a href=\"#Basic-String-methods\" data-toc-modified-id=\"Basic-String-methods-3.1\"><span class=\"toc-item-num\">3.1 </span>Basic String methods</a></span></li><li><span><a href=\"#Regular-Expressions\" data-toc-modified-id=\"Regular-Expressions-3.2\"><span class=\"toc-item-num\">3.2 </span>Regular Expressions</a></span></li><li><span><a href=\"#Vectorized-String-Functions-in-pandas\" data-toc-modified-id=\"Vectorized-String-Functions-in-pandas-3.3\"><span class=\"toc-item-num\">3.3 </span>Vectorized String Functions in pandas</a></span></li></ul></li></ul></div>" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:50.844074Z", "start_time": "2019-12-24T17:42:50.026363Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:50.856824Z", "start_time": "2019-12-24T17:42:50.848076Z" } }, "outputs": [], "source": [ "import builtins\n", "\n", "def print(*args, **kwargs):\n", " builtins.print(*args, **kwargs, end='\\n\\n')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling Missing Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value **NaN** (Not a Number) to represent missing data. We call this a **sentinel value** that can be easily detected" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.015075Z", "start_time": "2019-12-24T17:42:50.858430Z" } }, "outputs": [ { "data": { "text/plain": [ "0 aardvark\n", "1 artichoke\n", "2 NaN\n", "3 avocado\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])\n", "string_data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.161311Z", "start_time": "2019-12-24T17:42:51.016109Z" } }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.295057Z", "start_time": "2019-12-24T17:42:51.164353Z" }, "hide_input": false }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.437478Z", "start_time": "2019-12-24T17:42:51.297653Z" } }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The built-in Python None value is also treated as NA in object arrays\n", "\n", "string_data[0] = None\n", "string_data.isnull()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.605133Z", "start_time": "2019-12-24T17:42:51.438477Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NA handling methods\n", "\n" ] }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Argument</th>\n", " <th>Description</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>dropna</td>\n", " <td>Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>fillna</td>\n", " <td>Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>isnull</td>\n", " <td>Return boolean values indicating which values are missing/NA.</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>notnull</td>\n", " <td>Negation of isnull.</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Argument \\\n", "1 dropna \n", "2 fillna \n", "3 isnull \n", "4 notnull \n", "\n", " Description \n", "1 Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. \n", "2 Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. \n", "3 Return boolean values indicating which values are missing/NA. \n", "4 Negation of isnull. " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.options.display.max_colwidth = 500\n", "df = pd.read_csv(r'examples/NA_handling_methods.csv', names=['Description'], sep='\\n')\n", "df.index = list(range(1, 5))\n", "\n", "def f1(x):\n", " x = x.split(' ')\n", " return x.pop(0)\n", "\n", "\n", "df['Argument'] = df['Description'].map(f1) # get the first word\n", "\n", "def f2(x):\n", " x = x.split(' ')\n", " return \" \".join(x[1:])\n", "\n", "df['Description'] = df['Description'].map(f2) # remove the first word\n", "\n", "df = df.reindex(columns=['Argument', 'Description'])\n", "print(\"NA handling methods\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering Out Missing Data\n", "\n", "There are a few ways to filter out missing data. While you always have the option to do it by hand using **pandas.isnull** and **boolean indexing**, the **dropna** can be helpful." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.746687Z", "start_time": "2019-12-24T17:42:51.607016Z" } }, "outputs": [], "source": [ "# On a Series, it returns the Series with only the non-null data and index values\n", "\n", "from numpy import nan as NA" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:51.894072Z", "start_time": "2019-12-24T17:42:51.750247Z" } }, "outputs": [], "source": [ "data = pd.Series([1, NA, 3.5, NA, 7])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.036025Z", "start_time": "2019-12-24T17:42:51.896067Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "2 3.5\n", "4 7.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna()\n", "# data[data.notnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by **default drops any row containing a missing value**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.179002Z", "start_time": "2019-12-24T17:42:52.037987Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],\n", " [NA, NA, NA], [NA, 6.5, 3.]])\n", "data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.326562Z", "start_time": "2019-12-24T17:42:52.179002Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cleaned = data.dropna()\n", "cleaned" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.492974Z", "start_time": "2019-12-24T17:42:52.328730Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Passing how='all' will only drop rows that are all NA\n", "\n", "data.dropna(how='all')\n", "# data.dropna(how='any')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.663286Z", "start_time": "2019-12-24T17:42:52.493950Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>4</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 4\n", "0 1.0 6.5 3.0 NaN\n", "1 1.0 NaN NaN NaN\n", "2 NaN NaN NaN NaN\n", "3 NaN 6.5 3.0 NaN" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To drop columns in the same way, pass axis=1\n", "\n", "data[4] = NA\n", "data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.828878Z", "start_time": "2019-12-24T17:42:52.665142Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>6.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna(axis=1, how='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A related way to filter out DataFrame rows tends to concern time series data. **Suppose you want to keep only rows containing a certain number of observations**. You can indicate this with the **thresh** argument" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:52.964051Z", "start_time": "2019-12-24T17:42:52.834105Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.196168</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-1.007727</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2.909089</td>\n", " <td>NaN</td>\n", " <td>-2.142449</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.866425</td>\n", " <td>NaN</td>\n", " <td>1.881426</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.300446</td>\n", " <td>-1.266830</td>\n", " <td>0.296984</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-0.963050</td>\n", " <td>1.649808</td>\n", " <td>-1.686630</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.087371</td>\n", " <td>-0.342160</td>\n", " <td>-0.608004</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.196168 NaN NaN\n", "1 -1.007727 NaN NaN\n", "2 2.909089 NaN -2.142449\n", "3 -0.866425 NaN 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randn(7, 3))\n", "df.iloc[:4, 1] = NA\n", "df.iloc[:2, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.139880Z", "start_time": "2019-12-24T17:42:52.967617Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004\n", "\n", " 0 1 2\n", "2 2.909089 NaN -2.142449\n", "3 -0.866425 NaN 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004\n", "\n" ] } ], "source": [ "print(df.dropna())\n", "print(df.dropna(thresh=2))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.321362Z", "start_time": "2019-12-24T17:42:53.141874Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.196168</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-1.007727</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2.909089</td>\n", " <td>NaN</td>\n", " <td>-2.142449</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.866425</td>\n", " <td>NaN</td>\n", " <td>1.881426</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.300446</td>\n", " <td>-1.266830</td>\n", " <td>0.296984</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-0.963050</td>\n", " <td>1.649808</td>\n", " <td>-1.686630</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.087371</td>\n", " <td>-0.342160</td>\n", " <td>-0.608004</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.196168 NaN NaN\n", "1 -1.007727 NaN NaN\n", "2 2.909089 NaN -2.142449\n", "3 -0.866425 NaN 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.477978Z", "start_time": "2019-12-24T17:42:53.324143Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3, 3)\n", "\n", "(5, 3)\n", "\n" ] } ], "source": [ "print(df.dropna(subset=[1, 2], how='any').shape)\n", "print(df.dropna(subset=[1, 2], how='all').shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filling In Missing Data" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.630300Z", "start_time": "2019-12-24T17:42:53.480393Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.196168</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-1.007727</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2.909089</td>\n", " <td>0.000000</td>\n", " <td>-2.142449</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.866425</td>\n", " <td>0.000000</td>\n", " <td>1.881426</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.300446</td>\n", " <td>-1.266830</td>\n", " <td>0.296984</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-0.963050</td>\n", " <td>1.649808</td>\n", " <td>-1.686630</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.087371</td>\n", " <td>-0.342160</td>\n", " <td>-0.608004</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.196168 0.000000 0.000000\n", "1 -1.007727 0.000000 0.000000\n", "2 2.909089 0.000000 -2.142449\n", "3 -0.866425 0.000000 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.795600Z", "start_time": "2019-12-24T17:42:53.632354Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.196168</td>\n", " <td>0.500000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-1.007727</td>\n", " <td>0.500000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2.909089</td>\n", " <td>0.500000</td>\n", " <td>-2.142449</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.866425</td>\n", " <td>0.500000</td>\n", " <td>1.881426</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.300446</td>\n", " <td>-1.266830</td>\n", " <td>0.296984</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-0.963050</td>\n", " <td>1.649808</td>\n", " <td>-1.686630</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.087371</td>\n", " <td>-0.342160</td>\n", " <td>-0.608004</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.196168 0.500000 0.000000\n", "1 -1.007727 0.500000 0.000000\n", "2 2.909089 0.500000 -2.142449\n", "3 -0.866425 0.500000 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calling fillna with a dict, you can use a different fill value for each column\n", "\n", "df.fillna({1: 0.5, 2: 0})" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:53.971110Z", "start_time": "2019-12-24T17:42:53.798627Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.196168</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-1.007727</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2.909089</td>\n", " <td>0.000000</td>\n", " <td>-2.142449</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.866425</td>\n", " <td>0.000000</td>\n", " <td>1.881426</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.300446</td>\n", " <td>-1.266830</td>\n", " <td>0.296984</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-0.963050</td>\n", " <td>1.649808</td>\n", " <td>-1.686630</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.087371</td>\n", " <td>-0.342160</td>\n", " <td>-0.608004</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1.196168 0.000000 0.000000\n", "1 -1.007727 0.000000 0.000000\n", "2 2.909089 0.000000 -2.142449\n", "3 -0.866425 0.000000 1.881426\n", "4 0.300446 -1.266830 0.296984\n", "5 -0.963050 1.649808 -1.686630\n", "6 -0.087371 -0.342160 -0.608004" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fillna returns a new object, but you can modify the existing object in-place\n", "_ = df.fillna(0, inplace=True)\n", "df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.119327Z", "start_time": "2019-12-24T17:42:53.972107Z" } }, "outputs": [], "source": [ "df = pd.DataFrame(np.random.randn(6, 3))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.585826Z", "start_time": "2019-12-24T17:42:54.119327Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-1.283063</td>\n", " <td>1.456059</td>\n", " <td>0.639081</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.863694</td>\n", " <td>-0.012437</td>\n", " <td>-0.741003</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.533858</td>\n", " <td>NaN</td>\n", " <td>-0.741355</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.690786</td>\n", " <td>NaN</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.296663</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-1.422233</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 -1.283063 1.456059 0.639081\n", "1 -0.863694 -0.012437 -0.741003\n", "2 0.533858 NaN -0.741355\n", "3 1.690786 NaN -0.396711\n", "4 -0.296663 NaN NaN\n", "5 -1.422233 NaN NaN" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:, 1] = NA\n", "df.iloc[4:, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.843584Z", "start_time": "2019-12-24T17:42:54.586856Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-1.283063</td>\n", " <td>1.456059</td>\n", " <td>0.639081</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.863694</td>\n", " <td>-0.012437</td>\n", " <td>-0.741003</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.533858</td>\n", " <td>-0.012437</td>\n", " <td>-0.741355</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.690786</td>\n", " <td>-0.012437</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.296663</td>\n", " <td>-0.012437</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-1.422233</td>\n", " <td>-0.012437</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 -1.283063 1.456059 0.639081\n", "1 -0.863694 -0.012437 -0.741003\n", "2 0.533858 -0.012437 -0.741355\n", "3 1.690786 -0.012437 -0.396711\n", "4 -0.296663 -0.012437 -0.396711\n", "5 -1.422233 -0.012437 -0.396711" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:54.997314Z", "start_time": "2019-12-24T17:42:54.846361Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-1.283063</td>\n", " <td>1.456059</td>\n", " <td>0.639081</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.863694</td>\n", " <td>-0.012437</td>\n", " <td>-0.741003</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.533858</td>\n", " <td>-0.012437</td>\n", " <td>-0.741355</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.690786</td>\n", " <td>-0.012437</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.296663</td>\n", " <td>NaN</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-1.422233</td>\n", " <td>NaN</td>\n", " <td>-0.396711</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 -1.283063 1.456059 0.639081\n", "1 -0.863694 -0.012437 -0.741003\n", "2 0.533858 -0.012437 -0.741355\n", "3 1.690786 -0.012437 -0.396711\n", "4 -0.296663 NaN -0.396711\n", "5 -1.422233 NaN -0.396711" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill', limit=2)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.162941Z", "start_time": "2019-12-24T17:42:54.998307Z" } }, "outputs": [], "source": [ "# we might pass mean or median of a series\n", "\n", "data = pd.Series([1., NA, 3.5, NA, 7])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.325659Z", "start_time": "2019-12-24T17:42:55.165933Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1.000000\n", "1 3.833333\n", "2 3.500000\n", "3 3.833333\n", "4 7.000000\n", "dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.fillna(data.mean())" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-23T17:37:04.279897Z", "start_time": "2019-12-23T17:37:04.272219Z" } }, "source": [ "**Arguments** - **Description** for fillna\n", "\n", "**value** - Scalar value or dict-like object to use to fill missing values\n", "\n", "**method** - Interpolation; by default 'ffill' if function called with no other arguments\n", "\n", "**axis** - Axis to fill on; default axis=0\n", "\n", "**inplace** - Modify the calling object without producing a copy\n", "\n", "**limit** - For forward and backward filling, maximum number of consecutive periods to fill" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Transformation\n", "\n", "### Removing Duplicates" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.494690Z", "start_time": "2019-12-24T17:42:55.328594Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>k1</th>\n", " <th>k2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>one</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>two</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>one</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>two</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>one</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>two</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>two</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n", " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DataFrame method **duplicated** returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.649465Z", "start_time": "2019-12-24T17:42:55.498032Z" } }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 True\n", "dtype: bool" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.duplicated()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.804370Z", "start_time": "2019-12-24T17:42:55.652247Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>k1</th>\n", " <th>k2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>one</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>two</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>one</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>two</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>one</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>two</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop_duplicates returns a DataFrame where the duplicated array is False\n", "\n", "data.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:55.962564Z", "start_time": "2019-12-24T17:42:55.805368Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>k1</th>\n", " <th>k2</th>\n", " <th>v1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>one</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>two</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>one</td>\n", " <td>2</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>two</td>\n", " <td>3</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>one</td>\n", " <td>3</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>two</td>\n", " <td>4</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>two</td>\n", " <td>4</td>\n", " <td>6</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1\n", "2 one 2 2\n", "3 two 3 3\n", "4 one 3 4\n", "5 two 4 5\n", "6 two 4 6" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Suppose we had an additional column of values and wanted to \n", "# filter duplicates only based on the 'k1' column\n", "\n", "data['v1'] = range(7)\n", "data" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.114393Z", "start_time": "2019-12-24T17:42:55.964622Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>k1</th>\n", " <th>k2</th>\n", " <th>v1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>one</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>two</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " k1 k2 v1\n", "0 one 1 0\n", "1 two 1 1" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates(['k1'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transforming Data Using a Function or Mapping\n", "\n", "For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.277029Z", "start_time": "2019-12-24T17:42:56.116562Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>food</th>\n", " <th>ounces</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>bacon</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>pulled pork</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>bacon</td>\n", " <td>12.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Pastrami</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>corned beef</td>\n", " <td>7.5</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Bacon</td>\n", " <td>8.0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>pastrami</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>honey ham</td>\n", " <td>5.0</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>nova lox</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " food ounces\n", "0 bacon 4.0\n", "1 pulled pork 3.0\n", "2 bacon 12.0\n", "3 Pastrami 6.0\n", "4 corned beef 7.5\n", "5 Bacon 8.0\n", "6 pastrami 3.0\n", "7 honey ham 5.0\n", "8 nova lox 6.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',\n", " 'Pastrami', 'corned beef', 'Bacon',\n", " 'pastrami', 'honey ham', 'nova lox'],\n", " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.417657Z", "start_time": "2019-12-24T17:42:56.278980Z" } }, "outputs": [], "source": [ "# Suppose you wanted to add a column indicating the \n", "# type of animal that each food came from\n", "\n", "meat_to_animal = {\n", " 'bacon': 'pig',\n", " 'pulled pork': 'pig',\n", " 'pastrami': 'cow',\n", " 'corned beef': 'cow',\n", " 'honey ham': 'pig',\n", " 'nova lox': 'salmon'\n", "}" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.595457Z", "start_time": "2019-12-24T17:42:56.419812Z" } }, "outputs": [ { "data": { "text/plain": [ "0 bacon\n", "1 pulled pork\n", "2 bacon\n", "3 pastrami\n", "4 corned beef\n", "5 bacon\n", "6 pastrami\n", "7 honey ham\n", "8 nova lox\n", "Name: food, dtype: object" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lowercased = data['food'].str.lower()\n", "lowercased" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.767759Z", "start_time": "2019-12-24T17:42:56.597643Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>food</th>\n", " <th>ounces</th>\n", " <th>animal</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>bacon</td>\n", " <td>4.0</td>\n", " <td>pig</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>pulled pork</td>\n", " <td>3.0</td>\n", " <td>pig</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>bacon</td>\n", " <td>12.0</td>\n", " <td>pig</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Pastrami</td>\n", " <td>6.0</td>\n", " <td>cow</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>corned beef</td>\n", " <td>7.5</td>\n", " <td>cow</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Bacon</td>\n", " <td>8.0</td>\n", " <td>pig</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>pastrami</td>\n", " <td>3.0</td>\n", " <td>cow</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>honey ham</td>\n", " <td>5.0</td>\n", " <td>pig</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>nova lox</td>\n", " <td>6.0</td>\n", " <td>salmon</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " food ounces animal\n", "0 bacon 4.0 pig\n", "1 pulled pork 3.0 pig\n", "2 bacon 12.0 pig\n", "3 Pastrami 6.0 cow\n", "4 corned beef 7.5 cow\n", "5 Bacon 8.0 pig\n", "6 pastrami 3.0 cow\n", "7 honey ham 5.0 pig\n", "8 nova lox 6.0 salmon" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['animal'] = lowercased.map(meat_to_animal)\n", "# or\n", "# data['food'].map(lambda x: meat_to_animal[x.lower()])\n", "data" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-23T17:56:05.001259Z", "start_time": "2019-12-23T17:56:04.992428Z" } }, "source": [ "### Replacing Values\n", "\n", "Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:56.900004Z", "start_time": "2019-12-24T17:42:56.768790Z" } }, "outputs": [], "source": [ "data = pd.Series([1., -999., 2., -999., -1000., 3.])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.091197Z", "start_time": "2019-12-24T17:42:56.900004Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 -999.0\n", "2 2.0\n", "3 -999.0\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.234609Z", "start_time": "2019-12-24T17:42:57.092227Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 -1000.0\n", "5 3.0\n", "dtype: float64\n", "\n", "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 NaN\n", "5 3.0\n", "dtype: float64\n", "\n", "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "4 0.0\n", "5 3.0\n", "dtype: float64\n", "\n", "None\n", "\n" ] } ], "source": [ "print(data.replace(-999, np.nan))\n", "print(data.replace([-999, -1000], np.nan))\n", "print(data.replace([-999, -1000], [np.nan, 0]))\n", "print(data.replace({-999: np.nan, -1000: 0}, inplace=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming Axis Indexes\n", "\n", "axis labels can be transformed by a function or mapping of some form to produce new, differently labeled objects. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.347390Z", "start_time": "2019-12-24T17:42:57.234609Z" } }, "outputs": [], "source": [ "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n", " index=['Ohio', 'Colorado', 'New York'],\n", " columns=['one', 'two', 'three', 'four'])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.486408Z", "start_time": "2019-12-24T17:42:57.349987Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['OHIO', 'COLO', 'NEW '], dtype='object')\n", "\n", "Index(['OH', 'CO', 'NE'], dtype='object')\n", "\n" ] } ], "source": [ "transform = lambda x: x[:4].upper()\n", "\n", "# Like a Series, the axis indexes have a map method\n", "\n", "print(data.index.map(transform))\n", "\n", "# -----------\n", "trans = lambda x: x[:2].upper()\n", "print(data.index.map(trans))" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.630459Z", "start_time": "2019-12-24T17:42:57.488063Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " <th>three</th>\n", " <th>four</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>OHIO</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>COLO</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>NEW</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two three four\n", "OHIO 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index = data.index.map(transform)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to create a transformed version of a dataset without modifying the original, a useful method is **rename**" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.779949Z", "start_time": "2019-12-24T17:42:57.631489Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>ONE</th>\n", " <th>TWO</th>\n", " <th>THREE</th>\n", " <th>FOUR</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Ohio</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>Colo</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>New</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " ONE TWO THREE FOUR\n", "Ohio 0 1 2 3\n", "Colo 4 5 6 7\n", "New 8 9 10 11" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(index=str.title, columns=str.upper)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:57.934232Z", "start_time": "2019-12-24T17:42:57.782092Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " <th>peekaboo</th>\n", " <th>four</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>INDIANA</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>COLO</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>NEW</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two peekaboo four\n", "INDIANA 0 1 2 3\n", "COLO 4 5 6 7\n", "NEW 8 9 10 11" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rename can be used in conjunction with a dict-like object\n", "# providing new values for a subset of the axis labels\n", "\n", "data.rename(index={'OHIO': 'INDIANA'},\n", " columns={'three': 'peekaboo'}, inplace=True)\n", "data" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.052890Z", "start_time": "2019-12-24T17:42:57.934232Z" } }, "outputs": [], "source": [ "# ToD: Discretization and Binning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting and Filtering Outliers\n", "\n", "Filtering or transforming outliers is largely a matter of applying array operations." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.224509Z", "start_time": "2019-12-24T17:42:58.052890Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>1000.000000</td>\n", " <td>1000.000000</td>\n", " <td>1000.000000</td>\n", " <td>1000.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>0.015383</td>\n", " <td>0.024896</td>\n", " <td>0.038430</td>\n", " <td>-0.064249</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>1.008377</td>\n", " <td>0.996469</td>\n", " <td>0.989867</td>\n", " <td>0.995204</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>-3.141061</td>\n", " <td>-3.430585</td>\n", " <td>-2.886733</td>\n", " <td>-3.260575</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>-0.708238</td>\n", " <td>-0.597616</td>\n", " <td>-0.656763</td>\n", " <td>-0.766798</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>0.051303</td>\n", " <td>0.051438</td>\n", " <td>0.048483</td>\n", " <td>-0.071339</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>0.730775</td>\n", " <td>0.691972</td>\n", " <td>0.666596</td>\n", " <td>0.639310</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>3.766905</td>\n", " <td>3.349857</td>\n", " <td>3.361674</td>\n", " <td>2.854713</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean 0.015383 0.024896 0.038430 -0.064249\n", "std 1.008377 0.996469 0.989867 0.995204\n", "min -3.141061 -3.430585 -2.886733 -3.260575\n", "25% -0.708238 -0.597616 -0.656763 -0.766798\n", "50% 0.051303 0.051438 0.048483 -0.071339\n", "75% 0.730775 0.691972 0.666596 0.639310\n", "max 3.766905 3.349857 3.361674 2.854713" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame(np.random.randn(1000, 4))\n", "data.describe()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.337732Z", "start_time": "2019-12-24T17:42:58.225600Z" } }, "outputs": [], "source": [ "col = data[0]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.470172Z", "start_time": "2019-12-24T17:42:58.345580Z" } }, "outputs": [ { "data": { "text/plain": [ "1 -3.141061\n", "56 3.766905\n", "247 3.008167\n", "787 -3.135201\n", "Name: 0, dtype: float64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col[np.abs(col) > 3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.634225Z", "start_time": "2019-12-24T17:42:58.473682Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>-3.141061</td>\n", " <td>0.783127</td>\n", " <td>0.469851</td>\n", " <td>-0.395636</td>\n", " </tr>\n", " <tr>\n", " <th>56</th>\n", " <td>3.766905</td>\n", " <td>0.030909</td>\n", " <td>1.633494</td>\n", " <td>1.140207</td>\n", " </tr>\n", " <tr>\n", " <th>247</th>\n", " <td>3.008167</td>\n", " <td>-0.992888</td>\n", " <td>0.260335</td>\n", " <td>-3.063802</td>\n", " </tr>\n", " <tr>\n", " <th>327</th>\n", " <td>-0.074563</td>\n", " <td>3.349857</td>\n", " <td>2.032817</td>\n", " <td>-0.609838</td>\n", " </tr>\n", " <tr>\n", " <th>405</th>\n", " <td>-0.362679</td>\n", " <td>-0.103159</td>\n", " <td>3.087357</td>\n", " <td>2.517252</td>\n", " </tr>\n", " <tr>\n", " <th>448</th>\n", " <td>-0.437464</td>\n", " <td>3.054526</td>\n", " <td>-0.254406</td>\n", " <td>1.047457</td>\n", " </tr>\n", " <tr>\n", " <th>611</th>\n", " <td>-0.369185</td>\n", " <td>-3.430585</td>\n", " <td>-0.515431</td>\n", " <td>-1.668876</td>\n", " </tr>\n", " <tr>\n", " <th>619</th>\n", " <td>-0.311282</td>\n", " <td>0.535545</td>\n", " <td>3.039966</td>\n", " <td>-1.199799</td>\n", " </tr>\n", " <tr>\n", " <th>631</th>\n", " <td>-0.923738</td>\n", " <td>0.346249</td>\n", " <td>-0.485636</td>\n", " <td>-3.260575</td>\n", " </tr>\n", " <tr>\n", " <th>765</th>\n", " <td>-0.856659</td>\n", " <td>-0.717638</td>\n", " <td>-1.155797</td>\n", " <td>-3.069395</td>\n", " </tr>\n", " <tr>\n", " <th>787</th>\n", " <td>-3.135201</td>\n", " <td>-1.338375</td>\n", " <td>-0.404171</td>\n", " <td>1.373373</td>\n", " </tr>\n", " <tr>\n", " <th>917</th>\n", " <td>0.437193</td>\n", " <td>-3.026429</td>\n", " <td>-0.438433</td>\n", " <td>0.024585</td>\n", " </tr>\n", " <tr>\n", " <th>984</th>\n", " <td>-0.629788</td>\n", " <td>0.308067</td>\n", " <td>3.361674</td>\n", " <td>-1.683448</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "1 -3.141061 0.783127 0.469851 -0.395636\n", "56 3.766905 0.030909 1.633494 1.140207\n", "247 3.008167 -0.992888 0.260335 -3.063802\n", "327 -0.074563 3.349857 2.032817 -0.609838\n", "405 -0.362679 -0.103159 3.087357 2.517252\n", "448 -0.437464 3.054526 -0.254406 1.047457\n", "611 -0.369185 -3.430585 -0.515431 -1.668876\n", "619 -0.311282 0.535545 3.039966 -1.199799\n", "631 -0.923738 0.346249 -0.485636 -3.260575\n", "765 -0.856659 -0.717638 -1.155797 -3.069395\n", "787 -3.135201 -1.338375 -0.404171 1.373373\n", "917 0.437193 -3.026429 -0.438433 0.024585\n", "984 -0.629788 0.308067 3.361674 -1.683448" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(np.abs(data) > 3).any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.768926Z", "start_time": "2019-12-24T17:42:58.635223Z" } }, "outputs": [], "source": [ "data[np.abs(data) > 3] = np.sign(data) * 3" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:58.935675Z", "start_time": "2019-12-24T17:42:58.768926Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>1000.000000</td>\n", " <td>1000.000000</td>\n", " <td>1000.000000</td>\n", " <td>1000.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>0.014885</td>\n", " <td>0.024949</td>\n", " <td>0.037941</td>\n", " <td>-0.063855</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>1.004937</td>\n", " <td>0.993711</td>\n", " <td>0.988330</td>\n", " <td>0.994002</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>-3.000000</td>\n", " <td>-3.000000</td>\n", " <td>-2.886733</td>\n", " <td>-3.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>-0.708238</td>\n", " <td>-0.597616</td>\n", " <td>-0.656763</td>\n", " <td>-0.766798</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>0.051303</td>\n", " <td>0.051438</td>\n", " <td>0.048483</td>\n", " <td>-0.071339</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>0.730775</td>\n", " <td>0.691972</td>\n", " <td>0.666596</td>\n", " <td>0.639310</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>3.000000</td>\n", " <td>3.000000</td>\n", " <td>3.000000</td>\n", " <td>2.854713</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean 0.014885 0.024949 0.037941 -0.063855\n", "std 1.004937 0.993711 0.988330 0.994002\n", "min -3.000000 -3.000000 -2.886733 -3.000000\n", "25% -0.708238 -0.597616 -0.656763 -0.766798\n", "50% 0.051303 0.051438 0.048483 -0.071339\n", "75% 0.730775 0.691972 0.666596 0.639310\n", "max 3.000000 3.000000 3.000000 2.854713" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.066787Z", "start_time": "2019-12-24T17:42:58.936677Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-1.0</td>\n", " <td>-1.0</td>\n", " <td>-1.0</td>\n", " <td>-1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-1.0</td>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>-1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1.0</td>\n", " <td>-1.0</td>\n", " <td>1.0</td>\n", " <td>-1.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.0</td>\n", " <td>-1.0</td>\n", " <td>1.0</td>\n", " <td>-1.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1.0</td>\n", " <td>-1.0</td>\n", " <td>-1.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "0 -1.0 -1.0 -1.0 -1.0\n", "1 -1.0 1.0 1.0 -1.0\n", "2 1.0 -1.0 1.0 -1.0\n", "3 1.0 -1.0 1.0 -1.0\n", "4 1.0 -1.0 -1.0 1.0" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The statement np.sign(data) produces 1 and –1 values based on \n", "# whether the values in data are positive or negative\n", "\n", "np.sign(data).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Permutation and Random Sampling\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.221723Z", "start_time": "2019-12-24T17:42:59.067821Z" } }, "outputs": [], "source": [ "df = pd.DataFrame(np.arange(20).reshape((5, 4)))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.370484Z", "start_time": "2019-12-24T17:42:59.222721Z" } }, "outputs": [ { "data": { "text/plain": [ "array([3, 4, 1, 2, 0])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampler = np.random.permutation(5)\n", "sampler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That array can then be used in iloc-based indexing or the equivalent **take** function" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.559900Z", "start_time": "2019-12-24T17:42:59.374481Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>12</td>\n", " <td>13</td>\n", " <td>14</td>\n", " <td>15</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>16</td>\n", " <td>17</td>\n", " <td>18</td>\n", " <td>19</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15\n", "4 16 17 18 19" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.714703Z", "start_time": "2019-12-24T17:42:59.562894Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>3</th>\n", " <td>12</td>\n", " <td>13</td>\n", " <td>14</td>\n", " <td>15</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>16</td>\n", " <td>17</td>\n", " <td>18</td>\n", " <td>19</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "3 12 13 14 15\n", "4 16 17 18 19\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "0 0 1 2 3" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.take(sampler)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select a random subset **without replacement**, you can use the **sample** method on Series and DataFrame" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:42:59.877470Z", "start_time": "2019-12-24T17:42:59.717464Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>4</th>\n", " <td>16</td>\n", " <td>17</td>\n", " <td>18</td>\n", " <td>19</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "4 16 17 18 19\n", "2 8 9 10 11\n", "1 4 5 6 7" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(n=3)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:00.022827Z", "start_time": "2019-12-24T17:42:59.880463Z" } }, "outputs": [ { "data": { "text/plain": [ "3 6\n", "1 7\n", "3 6\n", "0 5\n", "3 6\n", "0 5\n", "1 7\n", "2 -1\n", "2 -1\n", "4 4\n", "dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sample with replacement\n", "\n", "choices = pd.Series([5, 7, -1, 6, 4])\n", "draws = choices.sample(n=10, replace=True)\n", "draws" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing Indicator/Dummy Variables\n" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.237733Z", "start_time": "2019-12-24T17:43:00.025854Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Name</th>\n", " <th>Sex</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Embarked</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>C</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>female</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", " <td>female</td>\n", " <td>35.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>113803</td>\n", " <td>53.1000</td>\n", " <td>C123</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Allen, Mr. William Henry</td>\n", " <td>male</td>\n", " <td>35.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>373450</td>\n", " <td>8.0500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.266505Z", "start_time": "2019-12-24T17:43:01.239975Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Name</th>\n", " <th>Sex</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Embarked</th>\n", " <th>Sex_male</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>C</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>female</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", " <td>female</td>\n", " <td>35.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>113803</td>\n", " <td>53.1000</td>\n", " <td>C123</td>\n", " <td>S</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Allen, Mr. William Henry</td>\n", " <td>male</td>\n", " <td>35.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>373450</td>\n", " <td>8.0500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Sex_male \n", "0 0 A/5 21171 7.2500 NaN S 1 \n", "1 0 PC 17599 71.2833 C85 C 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 \n", "3 0 113803 53.1000 C123 S 0 \n", "4 0 373450 8.0500 NaN S 1 " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a dummy column for Sex column\n", "\n", "# using map method\n", "train['Sex_male'] = train.Sex.map({'female': 0, 'male': 1})\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.411456Z", "start_time": "2019-12-24T17:43:01.268780Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>female</th>\n", " <th>male</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>886</th>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>887</th>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>888</th>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>889</th>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>890</th>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>891 rows × 2 columns</p>\n", "</div>" ], "text/plain": [ " female male\n", "0 0 1\n", "1 1 0\n", "2 1 0\n", "3 1 0\n", "4 0 1\n", ".. ... ...\n", "886 0 1\n", "887 1 0\n", "888 1 0\n", "889 0 1\n", "890 0 1\n", "\n", "[891 rows x 2 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# using get_dummies\n", " \n", "pd.get_dummies(train.Sex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally, **k** categorical values, then use **k-1** dummy variables to represent it.\n", "\n", "for Sex, we have to categorical values so we need only 1 variable to encode all the given information. So we drop the first column." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.573467Z", "start_time": "2019-12-24T17:43:01.414109Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Sex_male</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>886</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>887</th>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>888</th>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>889</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>890</th>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>891 rows × 1 columns</p>\n", "</div>" ], "text/plain": [ " Sex_male\n", "0 1\n", "1 0\n", "2 0\n", "3 0\n", "4 1\n", ".. ...\n", "886 1\n", "887 0\n", "888 0\n", "889 1\n", "890 1\n", "\n", "[891 rows x 1 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.712385Z", "start_time": "2019-12-24T17:43:01.577414Z" } }, "outputs": [ { "data": { "text/plain": [ "S 644\n", "C 168\n", "Q 77\n", "Name: Embarked, dtype: int64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.Embarked.value_counts()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:01.845962Z", "start_time": "2019-12-24T17:43:01.712385Z" } }, "outputs": [], "source": [ "embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.023567Z", "start_time": "2019-12-24T17:43:01.845962Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Name</th>\n", " <th>Sex</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Embarked</th>\n", " <th>Sex_male</th>\n", " <th>Embarked_Q</th>\n", " <th>Embarked_S</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>C</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>female</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", " <td>female</td>\n", " <td>35.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>113803</td>\n", " <td>53.1000</td>\n", " <td>C123</td>\n", " <td>S</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Allen, Mr. William Henry</td>\n", " <td>male</td>\n", " <td>35.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>373450</td>\n", " <td>8.0500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Sex_male Embarked_Q \\\n", "0 0 A/5 21171 7.2500 NaN S 1 0 \n", "1 0 PC 17599 71.2833 C85 C 0 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 0 \n", "3 0 113803 53.1000 C123 S 0 0 \n", "4 0 373450 8.0500 NaN S 1 0 \n", "\n", " Embarked_S \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = pd.concat([train, embarked_dummies], axis=1)\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.754062Z", "start_time": "2019-12-24T17:43:02.024777Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Name</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Sex_female</th>\n", " <th>Sex_male</th>\n", " <th>Embarked_C</th>\n", " <th>Embarked_Q</th>\n", " <th>Embarked_S</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", " <td>35.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>113803</td>\n", " <td>53.1000</td>\n", " <td>C123</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Allen, Mr. William Henry</td>\n", " <td>35.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>373450</td>\n", " <td>8.0500</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>886</th>\n", " <td>887</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>Montvila, Rev. Juozas</td>\n", " <td>27.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>211536</td>\n", " <td>13.0000</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>887</th>\n", " <td>888</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Graham, Miss. Margaret Edith</td>\n", " <td>19.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>112053</td>\n", " <td>30.0000</td>\n", " <td>B42</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>888</th>\n", " <td>889</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>W./C. 6607</td>\n", " <td>23.4500</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>889</th>\n", " <td>890</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Behr, Mr. Karl Howell</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>111369</td>\n", " <td>30.0000</td>\n", " <td>C148</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>890</th>\n", " <td>891</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Dooley, Mr. Patrick</td>\n", " <td>32.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>370376</td>\n", " <td>7.7500</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>891 rows × 15 columns</p>\n", "</div>" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Age SibSp Parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "4 Allen, Mr. William Henry 35.0 0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas 27.0 0 0 \n", "887 Graham, Miss. Margaret Edith 19.0 0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" NaN 1 2 \n", "889 Behr, Mr. Karl Howell 26.0 0 0 \n", "890 Dooley, Mr. Patrick 32.0 0 0 \n", "\n", " Ticket Fare Cabin Sex_female Sex_male Embarked_C \\\n", "0 A/5 21171 7.2500 NaN 0 1 0 \n", "1 PC 17599 71.2833 C85 1 0 1 \n", "2 STON/O2. 3101282 7.9250 NaN 1 0 0 \n", "3 113803 53.1000 C123 1 0 0 \n", "4 373450 8.0500 NaN 0 1 0 \n", ".. ... ... ... ... ... ... \n", "886 211536 13.0000 NaN 0 1 0 \n", "887 112053 30.0000 B42 1 0 0 \n", "888 W./C. 6607 23.4500 NaN 1 0 0 \n", "889 111369 30.0000 C148 0 1 1 \n", "890 370376 7.7500 NaN 0 1 0 \n", "\n", " Embarked_Q Embarked_S \n", "0 0 1 \n", "1 0 0 \n", "2 0 1 \n", "3 0 1 \n", "4 0 1 \n", ".. ... ... \n", "886 0 1 \n", "887 0 1 \n", "888 0 1 \n", "889 0 0 \n", "890 1 0 \n", "\n", "[891 rows x 15 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pass columns to get_dummies makes it more easy\n", "\n", "train = pd.read_csv('http://bit.ly/kaggletrain')\n", "\n", "pd.get_dummies(train, columns=['Sex', 'Embarked'])" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.783417Z", "start_time": "2019-12-24T17:43:02.756504Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>PassengerId</th>\n", " <th>Survived</th>\n", " <th>Pclass</th>\n", " <th>Name</th>\n", " <th>Age</th>\n", " <th>SibSp</th>\n", " <th>Parch</th>\n", " <th>Ticket</th>\n", " <th>Fare</th>\n", " <th>Cabin</th>\n", " <th>Sex_male</th>\n", " <th>Embarked_Q</th>\n", " <th>Embarked_S</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", " <td>35.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>113803</td>\n", " <td>53.1000</td>\n", " <td>C123</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Allen, Mr. William Henry</td>\n", " <td>35.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>373450</td>\n", " <td>8.0500</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>886</th>\n", " <td>887</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>Montvila, Rev. Juozas</td>\n", " <td>27.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>211536</td>\n", " <td>13.0000</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>887</th>\n", " <td>888</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Graham, Miss. Margaret Edith</td>\n", " <td>19.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>112053</td>\n", " <td>30.0000</td>\n", " <td>B42</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>888</th>\n", " <td>889</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>W./C. 6607</td>\n", " <td>23.4500</td>\n", " <td>NaN</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>889</th>\n", " <td>890</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Behr, Mr. Karl Howell</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>111369</td>\n", " <td>30.0000</td>\n", " <td>C148</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>890</th>\n", " <td>891</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Dooley, Mr. Patrick</td>\n", " <td>32.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>370376</td>\n", " <td>7.7500</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>891 rows × 13 columns</p>\n", "</div>" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Age SibSp Parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 \n", "4 Allen, Mr. William Henry 35.0 0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas 27.0 0 0 \n", "887 Graham, Miss. Margaret Edith 19.0 0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" NaN 1 2 \n", "889 Behr, Mr. Karl Howell 26.0 0 0 \n", "890 Dooley, Mr. Patrick 32.0 0 0 \n", "\n", " Ticket Fare Cabin Sex_male Embarked_Q Embarked_S \n", "0 A/5 21171 7.2500 NaN 1 0 1 \n", "1 PC 17599 71.2833 C85 0 0 0 \n", "2 STON/O2. 3101282 7.9250 NaN 0 0 1 \n", "3 113803 53.1000 C123 0 0 1 \n", "4 373450 8.0500 NaN 1 0 1 \n", ".. ... ... ... ... ... ... \n", "886 211536 13.0000 NaN 1 0 1 \n", "887 112053 30.0000 B42 0 0 1 \n", "888 W./C. 6607 23.4500 NaN 0 0 1 \n", "889 111369 30.0000 C148 1 0 0 \n", "890 370376 7.7500 NaN 1 1 0 \n", "\n", "[891 rows x 13 columns]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop the first column after get_dummies\n", "\n", "pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:02.991607Z", "start_time": "2019-12-24T17:43:02.784456Z" } }, "outputs": [], "source": [ "# example of MovieLens 1M dataset\n", "\n", "mnames = ['movieid', 'title', 'genres']" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.179736Z", "start_time": "2019-12-24T17:43:02.993603Z" } }, "outputs": [], "source": [ "movies = pd.read_table(r'MovieLens-1M/movies.dat', sep='::', header=None,\n", " names=mnames, engine='python')" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.445851Z", "start_time": "2019-12-24T17:43:03.182662Z" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>movieid</th>\n", " <th>title</th>\n", " <th>genres</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>Toy Story (1995)</td>\n", " <td>Animation|Children's|Comedy</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>Jumanji (1995)</td>\n", " <td>Adventure|Children's|Fantasy</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>Grumpier Old Men (1995)</td>\n", " <td>Comedy|Romance</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>Waiting to Exhale (1995)</td>\n", " <td>Comedy|Drama</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>Father of the Bride Part II (1995)</td>\n", " <td>Comedy</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " movieid title genres\n", "0 1 Toy Story (1995) Animation|Children's|Comedy\n", "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n", "2 3 Grumpier Old Men (1995) Comedy|Romance\n", "3 4 Waiting to Exhale (1995) Comedy|Drama\n", "4 5 Father of the Bride Part II (1995) Comedy" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.head()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.606208Z", "start_time": "2019-12-24T17:43:03.447874Z" } }, "outputs": [], "source": [ "all_genres = []\n", "\n", "for x in movies.genres:\n", " all_genres.extend(x.split('|'))" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.741670Z", "start_time": "2019-12-24T17:43:03.609697Z" } }, "outputs": [ { "data": { "text/plain": [ "array(['Animation', \"Children's\", 'Comedy', 'Adventure', 'Fantasy',\n", " 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',\n", " 'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',\n", " 'Western'], dtype=object)" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genres = pd.unique(all_genres)\n", "genres" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:03.877471Z", "start_time": "2019-12-24T17:43:03.745292Z" } }, "outputs": [], "source": [ "# One way to construct the indicator DataFrame is to start with a\n", "# DataFrame of allzeros\n", "\n", "zero_matrix = np.zeros((len(movies), len(genres)))" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:04.014237Z", "start_time": "2019-12-24T17:43:03.881359Z" } }, "outputs": [], "source": [ "dummies = pd.DataFrame(zero_matrix, columns=genres)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the dummies.columns to compute the column indices for each genre" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:04.141318Z", "start_time": "2019-12-24T17:43:04.016094Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Animation', \"Children's\", 'Comedy']\n", "\n" ] }, { "data": { "text/plain": [ "array([0, 1, 2], dtype=int64)" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen = movies.genres[0]\n", "print(gen.split('|'))\n", "dummies.columns.get_indexer(gen.split('|'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can use .iloc to set values based on these indices" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:06.917275Z", "start_time": "2019-12-24T17:43:04.144485Z" } }, "outputs": [], "source": [ "for i, gen in enumerate(movies.genres):\n", " indices = dummies.columns.get_indexer(gen.split('|'))\n", " dummies.iloc[i, indices] = 1" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:06.932929Z", "start_time": "2019-12-24T17:43:06.917275Z" } }, "outputs": [], "source": [ "# join with movies\n", "\n", "movies_windic = movies.join(dummies.add_prefix('Genre_'))" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.061469Z", "start_time": "2019-12-24T17:43:06.935434Z" } }, "outputs": [ { "data": { "text/plain": [ "movieid 1\n", "title Toy Story (1995)\n", "genres Animation|Children's|Comedy\n", "Genre_Animation 1\n", "Genre_Children's 1\n", "Genre_Comedy 1\n", "Genre_Adventure 0\n", "Genre_Fantasy 0\n", "Genre_Romance 0\n", "Genre_Drama 0\n", "Genre_Action 0\n", "Genre_Crime 0\n", "Genre_Thriller 0\n", "Genre_Horror 0\n", "Genre_Sci-Fi 0\n", "Genre_Documentary 0\n", "Genre_War 0\n", "Genre_Musical 0\n", "Genre_Mystery 0\n", "Genre_Film-Noir 0\n", "Genre_Western 0\n", "Name: 0, dtype: object" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_windic.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Manipulation\n", "### Basic String methods" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.209653Z", "start_time": "2019-12-24T17:43:07.063739Z" } }, "outputs": [ { "data": { "text/plain": [ "['a', 'b', 'guido']" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val = 'a,b, guido'\n", "\n", "pieces = [x.strip() for x in val.split(',')]\n", "pieces" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.361358Z", "start_time": "2019-12-24T17:43:07.210654Z" } }, "outputs": [ { "data": { "text/plain": [ "'a::b::guido'" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"::\".join(pieces)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**built-in String methods**\n", "\n", "|Argument | Description |\n", "|---|---|\n", "|count|Return the number of non-overlapping occurrences of substring in the string. |\n", "|endswith|Returns True if string ends with suffix. |\n", "|startswith|Returns True if string starts with prefix. |\n", "|join|Use string as delimiter for concatenating a sequence of other strings. |\n", "|index|Return position of first character in substring if found in the string; raises ValueError if not found. |\n", "|find|Return position of first character of rst occurrence of substring in the string; like index, but returns –1 if not found.|\n", "|rfind|Return position of first character of last occurrence of substring in the string; returns –1 if not found. |\n", "|replace|Replace occurrences of string with another string.|\n", "|strip, rstrip, lstrip|Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively)\n", " for each element|\n", "|split|Break string into list of substrings using passed delimiter. |\n", "|lower|Convert alphabet characters to lowercase. |\n", "|upper|Convert alphabet characters to uppercase. |\n", "|casefold|Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form. \n", "|ljust, rjust|Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regular Expressions" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-24T13:32:20.008167Z", "start_time": "2019-12-24T13:32:19.995045Z" } }, "source": [ "The re module functions fall into three categories: pattern matching, substitution,\n", "and splitting. Naturally these are all related; a regex describes a pattern to locate in the\n", "text, which can then be used for many purposes." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.495362Z", "start_time": "2019-12-24T17:43:07.367465Z" } }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.637104Z", "start_time": "2019-12-24T17:43:07.495362Z" } }, "outputs": [ { "data": { "text/plain": [ "'foo bar\\tbaz \\tqux'" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"foo bar\\tbaz \\tqux\"\n", "text" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.780343Z", "start_time": "2019-12-24T17:43:07.638102Z" } }, "outputs": [ { "data": { "text/plain": [ "['foo', 'bar', 'baz', 'qux']" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.split('\\s+', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you call re.split('\\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with **re.compile**, forming a reusable regex object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a regex object with re.compile is highly recommended if you intend to\n", "apply the same expression to many strings; doing so will save CPU cycles" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:07.916446Z", "start_time": "2019-12-24T17:43:07.785420Z" } }, "outputs": [], "source": [ "regex = re.compile('\\s+')" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.056820Z", "start_time": "2019-12-24T17:43:07.920407Z" } }, "outputs": [ { "data": { "text/plain": [ "['foo', 'bar', 'baz', 'qux']" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.split(text)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.205267Z", "start_time": "2019-12-24T17:43:08.058813Z" } }, "outputs": [], "source": [ "text = \"\"\"Dave dave@google.com\n", "Steve steve@gmail.com\n", "Rob rob@gmail.com\n", "Ryan ryan@yahoo.com\n", "\"\"\"\n", "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n", "\n", "# re.IGNORECASE makes the regex case-insensitive\n", "regex = re.compile(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.365332Z", "start_time": "2019-12-24T17:43:08.212747Z" } }, "outputs": [ { "data": { "text/plain": [ "['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**match** and **search** are closely related to findall. While findall returns all matches\n", "in a string, search returns only the first match. More rigidly, match only matches at\n", "the beginning of the string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**search** returns a special match object for the first email address in the text. For the\n", "preceding regex, the match object can only tell us the start and end position of the\n", "pattern in the string" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.529015Z", "start_time": "2019-12-24T17:43:08.366332Z" } }, "outputs": [], "source": [ "m = regex.search(text)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:08.991732Z", "start_time": "2019-12-24T17:43:08.529015Z" } }, "outputs": [ { "data": { "text/plain": [ "<re.Match object; span=(5, 20), match='dave@google.com'>" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.132275Z", "start_time": "2019-12-24T17:43:08.994110Z" } }, "outputs": [ { "data": { "text/plain": [ "'dave@google.com'" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[m.start(): m.end()]" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.248262Z", "start_time": "2019-12-24T17:43:09.133848Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "\n" ] } ], "source": [ "print(regex.match(text))" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.404466Z", "start_time": "2019-12-24T17:43:09.253311Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dave REDACTED\n", "Steve REDACTED\n", "Rob REDACTED\n", "Ryan REDACTED\n", "\n", "\n" ] } ], "source": [ "print(regex.sub('REDACTED', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " segment each address into its three components: username, domain name, and domain suffix" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.547000Z", "start_time": "2019-12-24T17:43:09.407943Z" } }, "outputs": [], "source": [ "pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.688577Z", "start_time": "2019-12-24T17:43:09.548001Z" } }, "outputs": [], "source": [ "regex = re.compile(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.805024Z", "start_time": "2019-12-24T17:43:09.692210Z" } }, "outputs": [ { "data": { "text/plain": [ "('wesm', 'bright', 'net')" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = regex.match('wesm@bright.net')\n", "m.groups()" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:09.942973Z", "start_time": "2019-12-24T17:43:09.811263Z" } }, "outputs": [ { "data": { "text/plain": [ "[('dave', 'google', 'com'),\n", " ('steve', 'gmail', 'com'),\n", " ('rob', 'gmail', 'com'),\n", " ('ryan', 'yahoo', 'com')]" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regex.findall(text)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.101665Z", "start_time": "2019-12-24T17:43:09.947203Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dave Username: dave, Domain: google, Suffix: com\n", "Steve Username: steve, Domain: gmail, Suffix: com\n", "Rob Username: rob, Domain: gmail, Suffix: com\n", "Ryan Username: ryan, Domain: yahoo, Suffix: com\n", "\n", "\n" ] } ], "source": [ "print(regex.sub(r'Username: \\1, Domain: \\2, Suffix: \\3', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Regular expression methods**\n", "\n", "|Argument| Description|\n", "|-|-|\n", "|findall| Return all non-overlapping matching patterns in a string as a list|\n", "|finditer| Like findall, but returns an iterator|\n", "|match |Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None|\n", "|search| Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning|\n", "|split| Break string into pieces at each occurrence of pattern|\n", "|sub, subn|Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \\1, \\2, ... to refer to match group elements in the replacement string|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorized String Functions in pandas" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.245311Z", "start_time": "2019-12-24T17:43:10.105666Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave dave@google.com\n", "Steve steve@gmail.com\n", "Rob rob@gmail.com\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',\n", " 'Rob': 'rob@gmail.com', 'Wes': np.nan}\n", "\n", "data = pd.Series(data)\n", "data" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.380107Z", "start_time": "2019-12-24T17:43:10.246309Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Steve False\n", "Rob False\n", "Wes True\n", "dtype: bool" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example," ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.532755Z", "start_time": "2019-12-24T17:43:10.381108Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Steve True\n", "Rob True\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str.contains('gmail')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regular expressions can be used, too, along with any re options like IGNORECASE" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.680071Z", "start_time": "2019-12-24T17:43:10.536520Z" } }, "outputs": [ { "data": { "text/plain": [ "'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\\\.([A-Z]{2,4})'" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.819319Z", "start_time": "2019-12-24T17:43:10.684214Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave [(dave, google, com)]\n", "Steve [(steve, gmail, com)]\n", "Rob [(rob, gmail, com)]\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.str.findall(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:10.964110Z", "start_time": "2019-12-24T17:43:10.824320Z" } }, "outputs": [], "source": [ " matches = data.str.match(pattern, flags=re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a couple of ways to do vectorized element retrieval. Either use **str.get** or index into the str attribute" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "ExecuteTime": { "end_time": "2019-12-24T17:43:11.118070Z", "start_time": "2019-12-24T17:43:10.965112Z" } }, "outputs": [ { "data": { "text/plain": [ "Dave True\n", "Steve True\n", "Rob True\n", "Wes NaN\n", "dtype: object" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matches" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }