{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dealing with Messy Data\n",
    "> This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset. This is the Summary of lecture \"Feature Engineering for Machine Learning in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why do missing values exist?\n",
    "- How gaps in data occur\n",
    "    - Data not being collected properly\n",
    "    - Collection and management errors\n",
    "    - Data intentionally being omitted\n",
    "    - Could be created due to transformations of the data\n",
    "- Why we care?\n",
    "    - Some models cannot work with missing data (Nulls/NaN)\n",
    "    - Missing data may be a sign a wider data issue\n",
    "    - Missing data can be a useful feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How sparse is my data?\n",
    "Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.\n",
    "\n",
    "Let's find out how many of the developers taking the survey chose to enter their age (found in the `Age` column of `so_survey_df`) and their gender (`Gender` column of `so_survey_df`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SurveyDate</th>\n",
       "      <th>FormalEducation</th>\n",
       "      <th>ConvertedSalary</th>\n",
       "      <th>Hobby</th>\n",
       "      <th>Country</th>\n",
       "      <th>StackOverflowJobsRecommend</th>\n",
       "      <th>VersionControl</th>\n",
       "      <th>Age</th>\n",
       "      <th>Years Experience</th>\n",
       "      <th>Gender</th>\n",
       "      <th>RawSalary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2/28/18 20:20</td>\n",
       "      <td>Bachelor's degree (BA. BS. B.Eng.. etc.)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Yes</td>\n",
       "      <td>South Africa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Git</td>\n",
       "      <td>21</td>\n",
       "      <td>13</td>\n",
       "      <td>Male</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6/28/18 13:26</td>\n",
       "      <td>Bachelor's degree (BA. BS. B.Eng.. etc.)</td>\n",
       "      <td>70841.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sweeden</td>\n",
       "      <td>7.0</td>\n",
       "      <td>Git;Subversion</td>\n",
       "      <td>38</td>\n",
       "      <td>9</td>\n",
       "      <td>Male</td>\n",
       "      <td>70,841.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6/6/18 3:37</td>\n",
       "      <td>Bachelor's degree (BA. BS. B.Eng.. etc.)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>Sweeden</td>\n",
       "      <td>8.0</td>\n",
       "      <td>Git</td>\n",
       "      <td>45</td>\n",
       "      <td>11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5/9/18 1:06</td>\n",
       "      <td>Some college/university study without earning ...</td>\n",
       "      <td>21426.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sweeden</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Zip file back-ups</td>\n",
       "      <td>46</td>\n",
       "      <td>12</td>\n",
       "      <td>Male</td>\n",
       "      <td>21,426.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4/12/18 22:41</td>\n",
       "      <td>Bachelor's degree (BA. BS. B.Eng.. etc.)</td>\n",
       "      <td>41671.0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>UK</td>\n",
       "      <td>8.0</td>\n",
       "      <td>Git</td>\n",
       "      <td>39</td>\n",
       "      <td>7</td>\n",
       "      <td>Male</td>\n",
       "      <td>£41,671.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      SurveyDate                                    FormalEducation  \\\n",
       "0  2/28/18 20:20           Bachelor's degree (BA. BS. B.Eng.. etc.)   \n",
       "1  6/28/18 13:26           Bachelor's degree (BA. BS. B.Eng.. etc.)   \n",
       "2    6/6/18 3:37           Bachelor's degree (BA. BS. B.Eng.. etc.)   \n",
       "3    5/9/18 1:06  Some college/university study without earning ...   \n",
       "4  4/12/18 22:41           Bachelor's degree (BA. BS. B.Eng.. etc.)   \n",
       "\n",
       "   ConvertedSalary Hobby       Country  StackOverflowJobsRecommend  \\\n",
       "0              NaN   Yes  South Africa                         NaN   \n",
       "1          70841.0   Yes       Sweeden                         7.0   \n",
       "2              NaN    No       Sweeden                         8.0   \n",
       "3          21426.0   Yes       Sweeden                         NaN   \n",
       "4          41671.0   Yes            UK                         8.0   \n",
       "\n",
       "      VersionControl  Age  Years Experience Gender   RawSalary  \n",
       "0                Git   21                13   Male         NaN  \n",
       "1     Git;Subversion   38                 9   Male   70,841.00  \n",
       "2                Git   45                11    NaN         NaN  \n",
       "3  Zip file back-ups   46                12   Male   21,426.00  \n",
       "4                Git   39                 7   Male  £41,671.00  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')\n",
    "so_survey_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Age       999\n",
      "Gender    693\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Subset the DataFrame\n",
    "sub_df = so_survey_df[['Age', 'Gender']]\n",
    "\n",
    "# Print the number of non-missing values\n",
    "print(sub_df.notnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding the missing values\n",
    "While having a summary of how much of your data is missing can be useful, often you will need to find the exact locations of these missing values. Using the same subset of the StackOverflow data from the last exercise (`sub_df`), you will show how a value can be flagged as missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>38</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>45</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>46</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>39</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>39</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>34</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>24</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>23</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>36</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Gender\n",
       "0   21    Male\n",
       "1   38    Male\n",
       "2   45     NaN\n",
       "3   46    Male\n",
       "4   39    Male\n",
       "5   39    Male\n",
       "6   34    Male\n",
       "7   24  Female\n",
       "8   23    Male\n",
       "9   36     NaN"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the top 10 entries of the DataFrame\n",
    "sub_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Age  Gender\n",
       "0  False   False\n",
       "1  False   False\n",
       "2  False    True\n",
       "3  False   False\n",
       "4  False   False\n",
       "5  False   False\n",
       "6  False   False\n",
       "7  False   False\n",
       "8  False   False\n",
       "9  False    True"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the locations of the missing values\n",
    "sub_df.head(10).isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Age  Gender\n",
       "0  True    True\n",
       "1  True    True\n",
       "2  True   False\n",
       "3  True    True\n",
       "4  True    True\n",
       "5  True    True\n",
       "6  True    True\n",
       "7  True    True\n",
       "8  True    True\n",
       "9  True   False"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the locations of the missing values\n",
    "sub_df.head(10).notnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dealing with missing values (I)\n",
    "- Issues with deletion\n",
    "    - It deletes vaild data points\n",
    "    - Relies on randomness\n",
    "    - Reduces information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Listwise deletion\n",
    "The simplest way to deal with missing values in your dataset when they are occurring entirely at random is to remove those rows, also called **'listwise deletion'**.\n",
    "\n",
    "Depending on the use case, you will sometimes want to remove all missing values in your data while other times you may want to only remove a particular column if too many values are missing in that column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(999, 11)\n"
     ]
    }
   ],
   "source": [
    "# Print the number of rows and columns\n",
    "print(so_survey_df.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(264, 11)\n"
     ]
    }
   ],
   "source": [
    "# Create a new DataFrame dropping all incomplete rows\n",
    "no_missing_values_rows = so_survey_df.dropna()\n",
    "\n",
    "# Print the shape of the new DataFrame\n",
    "print(no_missing_values_rows.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(999, 7)\n"
     ]
    }
   ],
   "source": [
    "# Create a new DataFrame dropping all columns with incomplete rows\n",
    "no_missing_values_cols = so_survey_df.dropna(axis=1)\n",
    "\n",
    "# Print the shape fo the new DataFrame\n",
    "print(no_missing_values_cols.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(693, 11)\n"
     ]
    }
   ],
   "source": [
    "# Drop all rows where Gender is missing\n",
    "no_gender = so_survey_df.dropna(subset=['Gender'])\n",
    "\n",
    "# Print the shape of the new DataFrame\n",
    "print(no_gender.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replacing missing values with constants\n",
    "While removing missing data entirely maybe a correct approach in many situations, this may result in a lot of information being omitted from your models.\n",
    "\n",
    "You may find categorical columns where the missing value is a valid piece of information in itself, such as someone refusing to answer a question in a survey. In these cases, you can fill all missing values with a new category entirely, for example 'No response given'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Male                                                                         632\n",
      "Female                                                                        53\n",
      "Transgender                                                                    2\n",
      "Female;Male                                                                    2\n",
      "Non-binary. genderqueer. or gender non-conforming                              1\n",
      "Male;Non-binary. genderqueer. or gender non-conforming                         1\n",
      "Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1\n",
      "Female;Transgender                                                             1\n",
      "Name: Gender, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Print the count of occurrence\n",
    "print(so_survey_df['Gender'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Male                                                                         632\n",
      "Not Given                                                                    306\n",
      "Female                                                                        53\n",
      "Transgender                                                                    2\n",
      "Female;Male                                                                    2\n",
      "Non-binary. genderqueer. or gender non-conforming                              1\n",
      "Male;Non-binary. genderqueer. or gender non-conforming                         1\n",
      "Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1\n",
      "Female;Transgender                                                             1\n",
      "Name: Gender, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Replace missing values\n",
    "so_survey_df['Gender'].fillna('Not Given', inplace=True)\n",
    "\n",
    "# Print the count of each value\n",
    "print(so_survey_df['Gender'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dealing with missing values (II)\n",
    "- Deleting missing values\n",
    "    - Can't delete rows with missing values in the test set\n",
    "- What else can you do?\n",
    "    - Categorical columns: Replace missing values with the most common occurring value or with a string that flags missing values such as 'None'\n",
    "    - Numerical columns: Replace missing values with a suitable value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filling continuous missing values\n",
    "In the last lesson, you dealt with different methods of removing data missing values and filling in missing values with a fixed string. These approaches are valid in many cases, particularly when dealing with categorical columns but have limited use when working with continuous values. In these cases, it may be most valid to fill the missing values in the column with a value calculated from the entries present in the column.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    NaN\n",
       "1    7.0\n",
       "2    8.0\n",
       "3    NaN\n",
       "4    8.0\n",
       "Name: StackOverflowJobsRecommend, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the first five rows of StackOverflowJobsRecommend column\n",
    "so_survey_df['StackOverflowJobsRecommend'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    7.0\n",
       "1    7.0\n",
       "2    8.0\n",
       "3    7.0\n",
       "4    8.0\n",
       "Name: StackOverflowJobsRecommend, dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fill missing values with the mean\n",
    "so_survey_df['StackOverflowJobsRecommend'].fillna(so_survey_df['StackOverflowJobsRecommend'].mean(), \n",
    "                                                  inplace=True)\n",
    "\n",
    "# Round the StackOverflowJobsRecommend values\n",
    "so_survey_df['StackOverflowJobsRecommend'] = round(so_survey_df['StackOverflowJobsRecommend'])\n",
    "\n",
    "# Print the first five rows of StackOverflowJobsRecommend column\n",
    "so_survey_df['StackOverflowJobsRecommend'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dealing with other data issues\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with stray characters (I)\n",
    "In this exercise, you will work with the `RawSalary` column of `so_survey_df` which contains the wages of the respondents along with the currency symbols and commas, such as `$42,000`. When importing data from Microsoft Excel, more often that not you will come across data in this form.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove the commas in the column\n",
    "so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')\n",
    "\n",
    "# Remove the dollar signs in the column\n",
    "so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$', '')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with stray characters (II)\n",
    "In the last exercise, you could tell quickly based off of the `df.head()` call which characters were causing an issue. In many cases this will not be so apparent. There will often be values deep within a column that are preventing you from casting a column as a numeric type so that it can be used in a model or further feature engineering.\n",
    "\n",
    "One approach to finding these values is to force the column to the data type desired using `pd.to_numeric()`, coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.\n",
    "\n",
    "Try to cast the `RawSalary` column as a float and it will fail as an additional character can now be found in it. Find the character and remove it so the column can be cast as a float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      NaN\n",
      "2      NaN\n",
      "6      NaN\n",
      "8      NaN\n",
      "11     NaN\n",
      "      ... \n",
      "989    NaN\n",
      "990    NaN\n",
      "992    NaN\n",
      "994    NaN\n",
      "997    NaN\n",
      "Name: RawSalary, Length: 334, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Attempt to convert the column to numeric values\n",
    "numeric_vals = pd.to_numeric(so_survey_df['RawSalary'], errors='coerce')\n",
    "\n",
    "# find the indexes of missing values\n",
    "idx = so_survey_df['RawSalary'].isna()\n",
    "\n",
    "# Print the relevant raws\n",
    "print(so_survey_df['RawSalary'][idx])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0            NaN\n",
       "1        70841.0\n",
       "2            NaN\n",
       "3        21426.0\n",
       "4        41671.0\n",
       "         ...    \n",
       "994          NaN\n",
       "995      58746.0\n",
       "996      55000.0\n",
       "997          NaN\n",
       "998    1000000.0\n",
       "Name: RawSalary, Length: 999, dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Replace the offending characters\n",
    "so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('£', '')\n",
    "\n",
    "# Convert the column to float\n",
    "so_survey_df['RawSalary'] = so_survey_df['RawSalary'].astype(float)\n",
    "\n",
    "# Print the column\n",
    "so_survey_df['RawSalary']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method chaining\n",
    "When applying multiple operations on the same column (like in the previous exercises), you made the changes in several steps, assigning the results back in each step. However, when applying multiple successive operations on the same column, you can \"chain\" these operations together for clarity and ease of management. This can be achieved by calling multiple methods sequentially:\n",
    "\n",
    "```python\n",
    "# Method chaining\n",
    "df['column'] = df['column'].method1().method2().method3()\n",
    "\n",
    "# Same as \n",
    "df['column'] = df['column'].method1()\n",
    "df['column'] = df['column'].method2()\n",
    "df['column'] = df['column'].method3()\n",
    "```\n",
    "In this exercise you will repeat the steps you performed in the last two exercises, but do so using method chaining."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0            NaN\n",
      "1        70841.0\n",
      "2            NaN\n",
      "3        21426.0\n",
      "4        41671.0\n",
      "         ...    \n",
      "994          NaN\n",
      "995      58746.0\n",
      "996      55000.0\n",
      "997          NaN\n",
      "998    1000000.0\n",
      "Name: RawSalary, Length: 999, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')\n",
    "# Use method chaining\n",
    "so_survey_df['RawSalary'] = so_survey_df['RawSalary']\\\n",
    "                            .str.replace(',', '')\\\n",
    "                            .str.replace('$', '')\\\n",
    "                            .str.replace('£', '')\\\n",
    "                            .astype(float)\n",
    "\n",
    "# Print the RawSalary column\n",
    "print(so_survey_df['RawSalary'])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}