{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Chapter 3\n", "---\n", "# Data Wrangling\n", "\n", "### 3.0 Introduction\n", "Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use.\n", "\n", "The most common data structure used to \"wrangle\" data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that htey are based on rows and columns like you'd find in a spreadsheet\n", "\n", "### 3.1 Creating a Data Frame" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "dataframe = pd.DataFrame()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Describing the Data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Name PClass Age Sex Survived SexCode\n", "0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1\n", "1 Allison, Miss Helen Loraine 1st 2.0 female 0 1\n", "Dimensions: (1313, 6)\n" ] }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Age</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>756.000000</td>\n", " <td>1313.000000</td>\n", " <td>1313.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>30.397989</td>\n", " <td>0.342727</td>\n", " <td>0.351866</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>14.259049</td>\n", " <td>0.474802</td>\n", " <td>0.477734</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>0.170000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>21.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>28.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>39.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>71.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Age Survived SexCode\n", "count 756.000000 1313.000000 1313.000000\n", "mean 30.397989 0.342727 0.351866\n", "std 14.259049 0.474802 0.477734\n", "min 0.170000 0.000000 0.000000\n", "25% 21.000000 0.000000 0.000000\n", "50% 28.000000 0.000000 0.000000\n", "75% 39.000000 1.000000 1.000000\n", "max 71.000000 1.000000 1.000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"http://tinyurl.com/titanic-csv\"\n", "df = pd.read_csv(url)\n", "# show first two rows\n", "print(df.head(2)) # also try tail(2) for last two rows\n", "\n", "# show dimensions\n", "print(\"Dimensions: {}\".format(df.shape))\n", "\n", "# show statistics\n", "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Navigating DataFrames" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name Allen, Miss Elisabeth Walton\n", "PClass 1st\n", "Age 29\n", "Sex female\n", "Survived 1\n", "SexCode 1\n", "Name: 0, dtype: object\n", " Name PClass Age Sex \\\n", "1 Allison, Miss Helen Loraine 1st 2.0 female \n", "2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male \n", "3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.0 female \n", "\n", " Survived SexCode \n", "1 0 1 \n", "2 0 0 \n", "3 0 1 \n", " Name PClass Age Sex \\\n", "0 Allen, Miss Elisabeth Walton 1st 29.0 female \n", "1 Allison, Miss Helen Loraine 1st 2.0 female \n", "2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male \n", "3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.0 female \n", "\n", " Survived SexCode \n", "0 1 1 \n", "1 0 1 \n", "2 0 0 \n", "3 0 1 \n" ] } ], "source": [ "# select the first row\n", "print(df.iloc[0])\n", "\n", "# select three rows\n", "print(df.iloc[1:4])\n", "\n", "# all rows up to and including the fourth row\n", "print(df.iloc[:4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrames do not need to be numerically indexed. We can set the index of a DataFrame to any value where the value is unique to each row. For example, we can set the index to be passenger names and then select rows using a name:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name Allen, Miss Elisabeth Walton\n", "PClass 1st\n", "Age 29\n", "Sex female\n", "Survived 1\n", "SexCode 1\n", "Name: Allen, Miss Elisabeth Walton, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set index\n", "df = df.set_index(df['Name'])\n", "\n", "# show row\n", "df.loc['Allen, Miss Elisabeth Walton']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion\n", "To select individual rows and slices of rows, pandas provides two methods:\n", "* `loc` is useful when the index of the DataFrame is a label (a string)\n", "* `iloc` works by looking for the position in the DataFrame. For exmaple, iloc[0] will return the first row regardless of whether the index is an integer or a label\n", "\n", "## 3.4 Selecting Rows Based on Conditionals" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>2.0</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Age \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st 29.0 \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st 2.0 \n", "\n", " Sex Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton female 1 1 \n", "Allison, Miss Helen Loraine female 0 1 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select top two rows where column 'sex' is 'female'\n", "df[df['Sex'] == 'female'].head(2)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead)</th>\n", " <td>Crosby, Mrs Edward Gifford (Catherine Elizabet...</td>\n", " <td>1st</td>\n", " <td>69.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name \\\n", "Name \n", "Crosby, Mrs Edward Gifford (Catherine Elizabeth... Crosby, Mrs Edward Gifford (Catherine Elizabet... \n", "\n", " PClass Age Sex \\\n", "Name \n", "Crosby, Mrs Edward Gifford (Catherine Elizabeth... 1st 69.0 female \n", "\n", " Survived SexCode \n", "Name \n", "Crosby, Mrs Edward Gifford (Catherine Elizabeth... 1 1 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# multiple conditions\n", "df[(df['Sex'] == 'female') & (df['Age'] >= 65)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.5 Replacing Values" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name\n", "Allen, Miss Elisabeth Walton Woman\n", "Allison, Miss Helen Loraine Woman\n", "Name: Sex, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace any instance of 'female' with Woman\n", "df['Sex'].replace('female', 'Woman').head(2)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name\n", "Allen, Miss Elisabeth Walton Woman\n", "Allison, Miss Helen Loraine Woman\n", "Allison, Mr Hudson Joshua Creighton Man\n", "Allison, Mrs Hudson JC (Bessie Waldo Daniels) Woman\n", "Allison, Master Hudson Trevor Man\n", "Name: Sex, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace any instance of 'female' with Woman\n", "df['Sex'].replace(['female', 'male'], ['Woman', 'Man']).head(5)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29</td>\n", " <td>female</td>\n", " <td>One</td>\n", " <td>One</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>2</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>One</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Age Sex \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st 29 female \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st 2 female \n", "\n", " Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton One One \n", "Allison, Miss Helen Loraine 0 One " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace(1, \"One\").head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.6 Renaming Columns" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>Passenger Class</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>2.0</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name Passenger Class \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st \n", "\n", " Age Sex Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton 29.0 female 1 1 \n", "Allison, Miss Helen Loraine 2.0 female 0 1 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(columns={'PClass': 'Passenger Class'}).head(2)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>Passenger Class</th>\n", " <th>Age</th>\n", " <th>Gender</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>2.0</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name Passenger Class \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st \n", "\n", " Age Gender Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton 29.0 female 1 1 \n", "Allison, Miss Helen Loraine 2.0 female 0 1 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(columns={'PClass': 'Passenger Class', 'Sex': 'Gender'}).head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.7 Finding the Min, Max, Sum, Average, and Count" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Maximum: 71.0\n", "Minimum: 0.17\n", "Mean: 30.397989417989418\n", "Sum: 22980.88\n", "Count: 756\n" ] } ], "source": [ "print('Maximum: {}'.format(df['Age'].max()))\n", "print('Minimum: {}'.format(df['Age'].min()))\n", "print('Mean: {}'.format(df['Age'].mean()))\n", "print('Sum: {}'.format(df['Age'].sum()))\n", "print('Count: {}'.format(df['Age'].count()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to these, pandas also offers variance (`var`), standard deviation (`std`), kurtosis (`kurt`), skewness (`skew`), and a number of others.\n", "\n", "We can also apply these methods to whole dataframes" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variance: Age 203.320470\n", "Survived 0.225437\n", "SexCode 0.228230\n", "dtype: float64\n", "Standard Deviation: Age 14.259049\n", "Survived 0.474802\n", "SexCode 0.477734\n", "dtype: float64\n", "Kurtosis: Age -0.036536\n", "Survived -1.562162\n", "SexCode -1.616702\n", "dtype: float64\n", "Skewness: Age 0.368511\n", "Survived 0.663491\n", "SexCode 0.621098\n", "dtype: float64\n" ] } ], "source": [ "print(\"Variance: {}\".format(df.var()))\n", "print(\"Standard Deviation: {}\".format(df.std()))\n", "print(\"Kurtosis: {}\".format(df.kurt()))\n", "print(\"Skewness: {}\".format(df.skew()))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.8 Finding Unique Values" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['female', 'male'], dtype=object)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# unique will return an array of all unique values in a column\n", "df['Sex'].unique()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "male 851\n", "female 462\n", "Name: Sex, dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# value_counts will display all unique values with the number of times each value appears\n", "df['Sex'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.9 Handling Missing Values" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Aubert, Mrs Leontine Pauline</th>\n", " <td>Aubert, Mrs Leontine Pauline</td>\n", " <td>1st</td>\n", " <td>NaN</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Barkworth, Mr Algernon H</th>\n", " <td>Barkworth, Mr Algernon H</td>\n", " <td>1st</td>\n", " <td>NaN</td>\n", " <td>male</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Age \\\n", "Name \n", "Aubert, Mrs Leontine Pauline Aubert, Mrs Leontine Pauline 1st NaN \n", "Barkworth, Mr Algernon H Barkworth, Mr Algernon H 1st NaN \n", "\n", " Sex Survived SexCode \n", "Name \n", "Aubert, Mrs Leontine Pauline female 1 1 \n", "Barkworth, Mr Algernon H male 1 0 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select missing values, show 2 rows\n", "df[df['Age'].isnull()].head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.10 Deleting a Column" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Sex \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st female \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st female \n", "\n", " Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton 1 1 \n", "Allison, Miss Helen Loraine 0 1 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# axis=1 means the column axis\n", "df.drop('Age', axis=1).head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.11 Deleting a Row" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>2.0</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Age \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st 29.0 \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st 2.0 \n", "\n", " Sex Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton female 1 1 \n", "Allison, Miss Helen Loraine female 0 1 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create new dataframe excluding the rows you want to delete\n", "df[df['Sex'] != 'male'].head(2)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Mr Hudson Joshua Creighton</th>\n", " <td>Allison, Mr Hudson Joshua Creighton</td>\n", " <td>1st</td>\n", " <td>30.0</td>\n", " <td>male</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton \n", "Allison, Mr Hudson Joshua Creighton Allison, Mr Hudson Joshua Creighton \n", "\n", " PClass Age Sex Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 \n", "Allison, Mr Hudson Joshua Creighton 1st 30.0 male 0 0 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# delete a row by matching a unique value\n", "df[df['Name'] != 'Allison, Miss Helen Loraine'].head(2)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Name</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Allen, Miss Elisabeth Walton</th>\n", " <td>Allen, Miss Elisabeth Walton</td>\n", " <td>1st</td>\n", " <td>29.0</td>\n", " <td>female</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>Allison, Miss Helen Loraine</th>\n", " <td>Allison, Miss Helen Loraine</td>\n", " <td>1st</td>\n", " <td>2.0</td>\n", " <td>female</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Age \\\n", "Name \n", "Allen, Miss Elisabeth Walton Allen, Miss Elisabeth Walton 1st 29.0 \n", "Allison, Miss Helen Loraine Allison, Miss Helen Loraine 1st 2.0 \n", "\n", " Sex Survived SexCode \n", "Name \n", "Allen, Miss Elisabeth Walton female 1 1 \n", "Allison, Miss Helen Loraine female 0 1 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# delete a row by index\n", "df[df.index != 0].head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.13 Grouping Rows by Values" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Age</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Sex</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>female</th>\n", " <td>29.396424</td>\n", " <td>0.666667</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>male</th>\n", " <td>31.014338</td>\n", " <td>0.166863</td>\n", " <td>0.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Age Survived SexCode\n", "Sex \n", "female 29.396424 0.666667 1.0\n", "male 31.014338 0.166863 0.0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('Sex').mean()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Survived\n", "0 863\n", "1 450\n", "Name: Name, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('Survived')['Name'].count()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sex Survived\n", "female 0 24.901408\n", " 1 30.867143\n", "male 0 32.320780\n", " 1 25.951875\n", "Name: Age, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['Sex', 'Survived'])['Age'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.14 Grouping Rows by Time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.15 Looping Over a Column" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ALLEN, MISS ELISABETH WALTON\n", "ALLISON, MISS HELEN LORAINE\n" ] } ], "source": [ "# for .. in .. loop\n", "for name in df['Name'][0:2]:\n", " print(name.upper())" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# list comprehension (more \"pythonic\")\n", "[name.upper() for name in df['Name'][0:2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.16 Applying a Function Over All Elements in a Column" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name\n", "Allen, Miss Elisabeth Walton ALLEN, MISS ELISABETH WALTON\n", "Allison, Miss Helen Loraine ALLISON, MISS HELEN LORAINE\n", "Name: Name, dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def uppercase(x):\n", " return x.upper()\n", "\n", "df['Name'].apply(uppercase)[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion\n", "`apply` is a great way to do data cleaning and wrangling. It is common to write a function to perform some useful operation (separate first and last names, convert string to floats, etc) and then map that funtion to every element in a column.\n", "\n", "## 3.17 Applying a Function to Groups" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>PClass</th>\n", " <th>Age</th>\n", " <th>Sex</th>\n", " <th>Survived</th>\n", " <th>SexCode</th>\n", " </tr>\n", " <tr>\n", " <th>Sex</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>female</th>\n", " <td>462</td>\n", " <td>462</td>\n", " <td>288</td>\n", " <td>462</td>\n", " <td>462</td>\n", " <td>462</td>\n", " </tr>\n", " <tr>\n", " <th>male</th>\n", " <td>851</td>\n", " <td>851</td>\n", " <td>468</td>\n", " <td>851</td>\n", " <td>851</td>\n", " <td>851</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Name PClass Age Sex Survived SexCode\n", "Sex \n", "female 462 462 288 462 462 462\n", "male 851 851 468 851 851 851" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('Sex').apply(lambda x: x.count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By combining `groupby` and `apply` we can calculate custom statistics or apply any function to each group separately\n", "\n", "## 3.18 Concatenating DataFrames" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### See Also\n", "* A Visual Explanation of SQL Joins (https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/)\n", "* pandas' documentation on merging (https://pandas.pydata.org/pandas-docs/stable/merging.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:machine_learning_cookbook]", "language": "python", "name": "conda-env-machine_learning_cookbook-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }