{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "A6iFUUQLNDlE"
   },
   "source": [
    "<table class=\"m01-notebook-buttons\" align=\"left\">\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/yy/dviz-course/blob/master/docs/m06-data/lab06.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"https://yyahn.com/dviz-course/m06-data/lab06/\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View on Github</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"https://raw.githubusercontent.com/yy/dviz-course/master/docs/m06-data/lab06.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View raw on Github</a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Cps1y3AA5TqD"
   },
   "source": [
    "# Module 6: Data types and tidy data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WU236-s35TqM"
   },
   "source": [
    "## Tidy data\n",
    "\n",
    "Let's take a look at this small dataset: https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "executionInfo": {
     "elapsed": 3,
     "status": "ok",
     "timestamp": 1686938700592,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "MG17akhx5TqN"
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 613
    },
    "executionInfo": {
     "elapsed": 168,
     "status": "ok",
     "timestamp": 1686938700927,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "Ah7fiB905TqQ",
    "outputId": "2435ddff-18ae-4d03-af72-ece1ab02df84"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>religion</th>\n",
       "      <th>&lt;$10k</th>\n",
       "      <th>$10-20k</th>\n",
       "      <th>$20-30k</th>\n",
       "      <th>$30-40k</th>\n",
       "      <th>$40-50k</th>\n",
       "      <th>$50-75k</th>\n",
       "      <th>$75-100k</th>\n",
       "      <th>$100-150k</th>\n",
       "      <th>&gt;150k</th>\n",
       "      <th>Don't know/refused</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Agnostic</td>\n",
       "      <td>27</td>\n",
       "      <td>34</td>\n",
       "      <td>60</td>\n",
       "      <td>81</td>\n",
       "      <td>76</td>\n",
       "      <td>137</td>\n",
       "      <td>122</td>\n",
       "      <td>109</td>\n",
       "      <td>84</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Atheist</td>\n",
       "      <td>12</td>\n",
       "      <td>27</td>\n",
       "      <td>37</td>\n",
       "      <td>52</td>\n",
       "      <td>35</td>\n",
       "      <td>70</td>\n",
       "      <td>73</td>\n",
       "      <td>59</td>\n",
       "      <td>74</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Buddhist</td>\n",
       "      <td>27</td>\n",
       "      <td>21</td>\n",
       "      <td>30</td>\n",
       "      <td>34</td>\n",
       "      <td>33</td>\n",
       "      <td>58</td>\n",
       "      <td>62</td>\n",
       "      <td>39</td>\n",
       "      <td>53</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Catholic</td>\n",
       "      <td>418</td>\n",
       "      <td>617</td>\n",
       "      <td>732</td>\n",
       "      <td>670</td>\n",
       "      <td>638</td>\n",
       "      <td>1116</td>\n",
       "      <td>949</td>\n",
       "      <td>792</td>\n",
       "      <td>633</td>\n",
       "      <td>1489</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Don’t know/refused</td>\n",
       "      <td>15</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "      <td>11</td>\n",
       "      <td>10</td>\n",
       "      <td>35</td>\n",
       "      <td>21</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Evangelical Prot</td>\n",
       "      <td>575</td>\n",
       "      <td>869</td>\n",
       "      <td>1064</td>\n",
       "      <td>982</td>\n",
       "      <td>881</td>\n",
       "      <td>1486</td>\n",
       "      <td>949</td>\n",
       "      <td>723</td>\n",
       "      <td>414</td>\n",
       "      <td>1529</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Hindu</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "      <td>11</td>\n",
       "      <td>34</td>\n",
       "      <td>47</td>\n",
       "      <td>48</td>\n",
       "      <td>54</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Historically Black Prot</td>\n",
       "      <td>228</td>\n",
       "      <td>244</td>\n",
       "      <td>236</td>\n",
       "      <td>238</td>\n",
       "      <td>197</td>\n",
       "      <td>223</td>\n",
       "      <td>131</td>\n",
       "      <td>81</td>\n",
       "      <td>78</td>\n",
       "      <td>339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Jehovah's Witness</td>\n",
       "      <td>20</td>\n",
       "      <td>27</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "      <td>21</td>\n",
       "      <td>30</td>\n",
       "      <td>15</td>\n",
       "      <td>11</td>\n",
       "      <td>6</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Jewish</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>25</td>\n",
       "      <td>25</td>\n",
       "      <td>30</td>\n",
       "      <td>95</td>\n",
       "      <td>69</td>\n",
       "      <td>87</td>\n",
       "      <td>151</td>\n",
       "      <td>162</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Mainline Prot</td>\n",
       "      <td>289</td>\n",
       "      <td>495</td>\n",
       "      <td>619</td>\n",
       "      <td>655</td>\n",
       "      <td>651</td>\n",
       "      <td>1107</td>\n",
       "      <td>939</td>\n",
       "      <td>753</td>\n",
       "      <td>634</td>\n",
       "      <td>1328</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Mormon</td>\n",
       "      <td>29</td>\n",
       "      <td>40</td>\n",
       "      <td>48</td>\n",
       "      <td>51</td>\n",
       "      <td>56</td>\n",
       "      <td>112</td>\n",
       "      <td>85</td>\n",
       "      <td>49</td>\n",
       "      <td>42</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Muslim</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Orthodox</td>\n",
       "      <td>13</td>\n",
       "      <td>17</td>\n",
       "      <td>23</td>\n",
       "      <td>32</td>\n",
       "      <td>32</td>\n",
       "      <td>47</td>\n",
       "      <td>38</td>\n",
       "      <td>42</td>\n",
       "      <td>46</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Other Christian</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "      <td>11</td>\n",
       "      <td>13</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>18</td>\n",
       "      <td>14</td>\n",
       "      <td>12</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Other Faiths</td>\n",
       "      <td>20</td>\n",
       "      <td>33</td>\n",
       "      <td>40</td>\n",
       "      <td>46</td>\n",
       "      <td>49</td>\n",
       "      <td>63</td>\n",
       "      <td>46</td>\n",
       "      <td>40</td>\n",
       "      <td>41</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Other World Religions</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Unaffiliated</td>\n",
       "      <td>217</td>\n",
       "      <td>299</td>\n",
       "      <td>374</td>\n",
       "      <td>365</td>\n",
       "      <td>341</td>\n",
       "      <td>528</td>\n",
       "      <td>407</td>\n",
       "      <td>321</td>\n",
       "      <td>258</td>\n",
       "      <td>597</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   religion  <$10k  $10-20k  $20-30k  $30-40k  $40-50k   \n",
       "0                  Agnostic     27       34       60       81       76  \\\n",
       "1                   Atheist     12       27       37       52       35   \n",
       "2                  Buddhist     27       21       30       34       33   \n",
       "3                  Catholic    418      617      732      670      638   \n",
       "4        Don’t know/refused     15       14       15       11       10   \n",
       "5          Evangelical Prot    575      869     1064      982      881   \n",
       "6                     Hindu      1        9        7        9       11   \n",
       "7   Historically Black Prot    228      244      236      238      197   \n",
       "8         Jehovah's Witness     20       27       24       24       21   \n",
       "9                    Jewish     19       19       25       25       30   \n",
       "10            Mainline Prot    289      495      619      655      651   \n",
       "11                   Mormon     29       40       48       51       56   \n",
       "12                   Muslim      6        7        9       10        9   \n",
       "13                 Orthodox     13       17       23       32       32   \n",
       "14          Other Christian      9        7       11       13       13   \n",
       "15             Other Faiths     20       33       40       46       49   \n",
       "16    Other World Religions      5        2        3        4        2   \n",
       "17             Unaffiliated    217      299      374      365      341   \n",
       "\n",
       "    $50-75k  $75-100k  $100-150k  >150k  Don't know/refused  \n",
       "0       137       122        109     84                  96  \n",
       "1        70        73         59     74                  76  \n",
       "2        58        62         39     53                  54  \n",
       "3      1116       949        792    633                1489  \n",
       "4        35        21         17     18                 116  \n",
       "5      1486       949        723    414                1529  \n",
       "6        34        47         48     54                  37  \n",
       "7       223       131         81     78                 339  \n",
       "8        30        15         11      6                  37  \n",
       "9        95        69         87    151                 162  \n",
       "10     1107       939        753    634                1328  \n",
       "11      112        85         49     42                  69  \n",
       "12       23        16          8      6                  22  \n",
       "13       47        38         42     46                  73  \n",
       "14       14        18         14     12                  18  \n",
       "15       63        46         40     41                  71  \n",
       "16        7         3          4      4                   8  \n",
       "17      528       407        321    258                 597  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv')\n",
    "pew_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "syUy_FXc5TqT"
   },
   "source": [
    "This dataset is about the relationships between income and religion, assembled from a research by the Pew Research Center. Is this dataset tidy or not? Why?\n",
    "\n",
    "Yes, many of the columns are values, not variable names. How should we fix it?\n",
    "\n",
    "Pandas provides a convenient function called [melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html). You specify the `id_vars` that are variable columns, and `value_vars` that are value columns, and provide the name for the variable as well as the name for the values.\n",
    "\n",
    "**Q: so please go ahead and tidy it up! I'd suggest to use the variable name \"income\" and value name \"frequency\"**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "executionInfo": {
     "elapsed": 131,
     "status": "ok",
     "timestamp": 1686938983642,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "1MYnYfhe5TqU"
   },
   "outputs": [],
   "source": [
    "# YOUR SOLUTION HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "k7Xz2GAn5TqV"
   },
   "source": [
    "If you were successful, you'll have something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>religion</th>\n",
       "      <th>income</th>\n",
       "      <th>frequency</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Agnostic</td>\n",
       "      <td>&lt;$10k</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Atheist</td>\n",
       "      <td>&lt;$10k</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Buddhist</td>\n",
       "      <td>&lt;$10k</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Catholic</td>\n",
       "      <td>&lt;$10k</td>\n",
       "      <td>418</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Don’t know/refused</td>\n",
       "      <td>&lt;$10k</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>175</th>\n",
       "      <td>Orthodox</td>\n",
       "      <td>Don't know/refused</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>176</th>\n",
       "      <td>Other Christian</td>\n",
       "      <td>Don't know/refused</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>177</th>\n",
       "      <td>Other Faiths</td>\n",
       "      <td>Don't know/refused</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>Other World Religions</td>\n",
       "      <td>Don't know/refused</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>Unaffiliated</td>\n",
       "      <td>Don't know/refused</td>\n",
       "      <td>597</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>180 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  religion              income  frequency\n",
       "0                 Agnostic               <$10k         27\n",
       "1                  Atheist               <$10k         12\n",
       "2                 Buddhist               <$10k         27\n",
       "3                 Catholic               <$10k        418\n",
       "4       Don’t know/refused               <$10k         15\n",
       "..                     ...                 ...        ...\n",
       "175               Orthodox  Don't know/refused         73\n",
       "176        Other Christian  Don't know/refused         18\n",
       "177           Other Faiths  Don't know/refused         71\n",
       "178  Other World Religions  Don't know/refused          8\n",
       "179           Unaffiliated  Don't know/refused        597\n",
       "\n",
       "[180 rows x 3 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This \"tidy\" form is the default form that you want to keep your datasets. It is easy to filter, manipulate, and plot. Many of the visualization libraries we'll use in this course expect the data to be in this form."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dDyOarQy5TqY"
   },
   "source": [
    "## Data types\n",
    "\n",
    "Let's talk about data types briefly. Understanding data types is not only important for choosing the right visualizations, but also important for efficient computing and storage of data. You may not have thought about how pandas represent data in memory. A Pandas `Dataframe` is essentially a bunch of `Series`, and those `Series` are essentially `numpy` arrays. An array may contain a fixed-length items such as integers or variable length items such as strings. Putting some efforts to think about the correct data type can potentially save a lot of memory as well as time.\n",
    "\n",
    "A nice example would be the categorical data type. If you have a variable that only has several possible values, it's essentially a categorical data. Take a look at the `income` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 25,
     "status": "ok",
     "timestamp": 1686938043328,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "0KoiuKUo5TqZ",
    "outputId": "f2742f7d-2fb9-47f2-8190-a167b33c230b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "income\n",
       "<$10k                 18\n",
       "$10-20k               18\n",
       "$20-30k               18\n",
       "$30-40k               18\n",
       "$40-50k               18\n",
       "$50-75k               18\n",
       "$75-100k              18\n",
       "$100-150k             18\n",
       ">150k                 18\n",
       "Don't know/refused    18\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.income.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wJW_t5_K5Tqb"
   },
   "source": [
    "These were the column names in the original non-tidy data. The value can take only one of these income ranges and thus it is a categorical data. What is the data type that pandas use to store this column?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 23,
     "status": "ok",
     "timestamp": 1686938043329,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "6oIXzznW5Tqb",
    "outputId": "b077eecd-5dd0-4af5-9c85-896196c9e0ea"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('O')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.income.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gq7wZ8RM5Tqc"
   },
   "source": [
    "The `O` means that it is an **O**bject data type, which does not have a fixed size like integer or float. The series contains a sort of pointer to the actual text objects. You can actually inspect the amount of memory used by the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 226,
     "status": "ok",
     "timestamp": 1686938043536,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "bZKUkW2o5Tqd",
    "outputId": "0528d96b-2750-4980-a014-16f615c5484e"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index         132\n",
       "religion     1440\n",
       "income       1440\n",
       "frequency    1440\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.memory_usage()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 38,
     "status": "ok",
     "timestamp": 1686938043538,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "NOGRStlY5Tqe",
    "outputId": "06cb3b8f-30d6-416e-9fd2-2281dc53987b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index          132\n",
       "religion     12780\n",
       "income       11700\n",
       "frequency     1440\n",
       "dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.memory_usage(deep=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LuX_IM-W5Tqe"
   },
   "source": [
    "What's going on with the `deep=True` option? When you don't specify `deep=True`, the memory usage method just tells you the amount of memory used by the numpy arrays in the pandas dataframe. When you pass `deep=True`, it tells you the total amount of memory by including the memory used by all the text objects. So, the `religion` and `income` columns occupies almost ten times of memory than the `frequency` column, which is simply an array of integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 33,
     "status": "ok",
     "timestamp": 1686938043539,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "5_THR_Ux5Tqf",
    "outputId": "e9a12ef7-df32-4cfc-9ce0-8ac9e2316bd1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('int64')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.frequency.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "31_v9CQE5Tqf"
   },
   "source": [
    "Is there any way to save up the memory? Note that there are only 10 categories in the income variable. That means we just need 10 numbers to represent the categories! Of course we need to store the names of each category, but that's just one-time cost. The simplest way to convert a column is using `astype` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "id": "cH2j6lrU5Tqg"
   },
   "outputs": [],
   "source": [
    "income_categorical_series = pew_tidy_df.income.astype('category')\n",
    "# you can do pew_tidy_df.income = pew_tidy_df.income.astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LO3liL7A5Tqh"
   },
   "source": [
    "Now, this series has the `CategoricalDtype` dtype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 26,
     "status": "ok",
     "timestamp": 1686938043542,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "VdDV-3FQ5Tqh",
    "outputId": "5e804917-ba2c-4eff-9416-738385f3affd"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CategoricalDtype(categories=['$10-20k', '$100-150k', '$20-30k', '$30-40k', '$40-50k',\n",
       "                  '$50-75k', '$75-100k', '<$10k', '>150k',\n",
       "                  'Don't know/refused'],\n",
       ", ordered=False)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "income_categorical_series.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1Ns4mncK5Tqi"
   },
   "source": [
    "How much memory do we use?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 23,
     "status": "ok",
     "timestamp": 1686938043543,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "ZqDBtWw75Tqi",
    "outputId": "94155744-dab0-4cf3-9d53-979f949abc88"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1262"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "income_categorical_series.memory_usage(deep=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared with the original column?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 20,
     "status": "ok",
     "timestamp": 1686938043544,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "eEyB-K3Y5Tqj",
    "outputId": "36f232c2-2bc2-48ea-fecf-ddf458e7e612"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11832"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.income.memory_usage(deep=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mRkVknuH5Tqk"
   },
   "source": [
    "We have reduced the memory usage by almost 10 fold! Not only that, because now the values are just numbers, it will be much faster to match, filter, manipulate. If your dataset is huge, this can save up a lot of space and time.\n",
    "\n",
    "If the categories have ordering, you can specify the ordering too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 17,
     "status": "ok",
     "timestamp": 1686938043545,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "dbXVdCUw5Tqk",
    "outputId": "a6ae3122-6ecb-4ba1-fdc7-0e973cf1ec7a"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',\n",
       "                  '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',\n",
       "                  '>150k'],\n",
       ", ordered=True)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pandas.api.types import CategoricalDtype\n",
    "income_type = CategoricalDtype(categories=[\"Don't know/refused\", '<$10k', '$10-20k', '$20-30k', '$30-40k',\n",
    "                                           '$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)\n",
    "income_type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 28,
     "status": "ok",
     "timestamp": 1686938043721,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "g5NwkZjB5Tqk",
    "outputId": "53f7fa6f-87f0-4fca-8546-d5ea01c4c328"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',\n",
       "                  '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',\n",
       "                  '>150k'],\n",
       ", ordered=True)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pew_tidy_df.income.astype(income_type).dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IFQtbxyF5Tql"
   },
   "source": [
    "This data type now allows you to compare and sort based on the ordering.\n",
    "\n",
    "**Q: ok, now convert both religion and income columns of `pew_tidy_df` as categorical dtype (in place) and show that `pew_tidy_df` now uses much less memory**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 170,
     "status": "ok",
     "timestamp": 1686938128628,
     "user": {
      "displayName": "Vincent Wong",
      "userId": "06927694896148305320"
     },
     "user_tz": 240
    },
    "id": "kaRvFJwV5Tql",
    "outputId": "09f3cd6d-b95f-46a2-ff87-e778269ba296"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index         132\n",
       "religion     2035\n",
       "income       1130\n",
       "frequency    1440\n",
       "dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# YOUR SOLUTION HERE\n",
    "\n",
    "pew_tidy_df.memory_usage(deep=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "opJC8rlG5Tqm"
   },
   "source": [
    "## If you want to know more\n",
    "\n",
    "- [Jean-Nicholas Hould: Tidy Data in Python](http://www.jeannicholashould.com/tidy-data-in-python.html)\n",
    "- [Stephen Simmons | Pandas from the Inside](https://www.youtube.com/watch?v=CowlcrtSyME)\n",
    "- [Data school: How do I make my pandas DataFrame smaller and faster?](https://www.youtube.com/watch?v=wDYDYGyN_cw)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "colab": {
   "provenance": []
  },
  "kernel_info": {
   "name": "dviz"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "nteract": {
   "version": "0.15.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}