{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# [教學目標]\n",
    "- 以下程式碼將示範在 python 如何利用 pandas.cut 與 .qcut 計算出數據的離散化標籤"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# [範例重點]\n",
    "- pandas.cut 的等寬劃分效果 (In[3], Out[4])\n",
    "- pandas.qcut 的等頻劃分效果 (In[5], Out[6])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 載入套件\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 初始設定 Ages 的資料\n",
    "ages = pd.DataFrame({\"age\": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 等寬劃分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新增欄位 \"equal_width_age\", 對年齡做等寬劃分\n",
    "ages[\"equal_width_age\"] = pd.cut(ages[\"age\"], 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6.907, 30.25]    10\n",
       "(30.25, 53.5]      3\n",
       "(76.75, 100.0]     2\n",
       "(53.5, 76.75]      1\n",
       "Name: equal_width_age, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 觀察等寬劃分下, 每個種組距各出現幾次\n",
    "ages[\"equal_width_age\"].value_counts() # 每個 bin 的值的範圍大小都是一樣的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 等頻劃分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新增欄位 \"equal_freq_age\", 對年齡做等頻劃分\n",
    "ages[\"equal_freq_age\"] = pd.qcut(ages[\"age\"], 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(42.0, 100.0]     4\n",
       "(26.0, 42.0]      4\n",
       "(20.25, 26.0]     4\n",
       "(6.999, 20.25]    4\n",
       "Name: equal_freq_age, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 觀察等頻劃分下, 每個種組距各出現幾次\n",
    "ages[\"equal_freq_age\"].value_counts() # 每個 bin 的資料筆數是一樣的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 作業\n",
    "新增一個欄位 `customized_age_grp`，把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組，'(' 表示不包含, ']' 表示包含\n",
    "\n",
    "Hints: 執行 ??pd.cut()，了解提供其中 bins 這個參數的使用方式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 作業\n",
    "- 新增一個欄位 `customized_age_grp`，把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組，\n",
    "'(' 表示不包含, ']' 表示包含  \n",
    "- Hints: 執行 ??pd.cut()，了解提供其中 bins 這個參數的使用方式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [作業目標]\n",
    "- 請同學試著查詢 pandas.cut 這個函數還有哪些參數, 藉由改動參數以達成目標\n",
    "- 藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [作業重點]\n",
    "- 仿照 In[3], In[4] 的語法, 並設定 pd.cut 的參數以指定間距"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 載入套件\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 初始設定 Ages 的資料\n",
    "ages = pd.DataFrame({\"age\": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 等寬劃分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新增欄位 \"equal_width_age\", 對年齡做等寬劃分\n",
    "ages[\"equal_width_age\"] = pd.cut(ages[\"age\"], 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "新增一個欄位 customized_age_grp，把 age 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組， '(' 表示不包含, ']' 表示包含\n",
    "\n",
    "</br>[Pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html?highlight=cut)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(20, 30]     6\n",
       "(50, 100]    3\n",
       "(30, 50]     3\n",
       "(10, 20]     2\n",
       "(0, 10]      2\n",
       "Name: customized_age_grp, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100])\n",
    "ages[\"customized_age_grp\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     1\n",
       "1     2\n",
       "2     2\n",
       "3     2\n",
       "4     0\n",
       "5     2\n",
       "6     2\n",
       "7     3\n",
       "8     2\n",
       "9     4\n",
       "10    3\n",
       "11    3\n",
       "12    0\n",
       "13    1\n",
       "14    4\n",
       "15    4\n",
       "Name: customized_age_grp, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],labels=False)\n",
    "ages[\"customized_age_grp\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2    6\n",
       "4    3\n",
       "3    3\n",
       "1    2\n",
       "0    2\n",
       "Name: customized_age_grp, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages[\"customized_age_grp\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      (10, 20]\n",
       "1      (20, 30]\n",
       "2      (20, 30]\n",
       "3      (20, 30]\n",
       "4       (0, 10]\n",
       "5      (20, 30]\n",
       "6      (20, 30]\n",
       "7      (30, 50]\n",
       "8      (20, 30]\n",
       "9     (50, 100]\n",
       "10     (30, 50]\n",
       "11     (30, 50]\n",
       "12      (0, 10]\n",
       "13     (10, 20]\n",
       "14    (50, 100]\n",
       "15    (50, 100]\n",
       "Name: customized_age_grp, dtype: category\n",
       "Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 50] < (50, 100]]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],include_lowest=False)\n",
    "ages[\"customized_age_grp\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(20.0, 30.0]      6\n",
       "(50.0, 100.0]     3\n",
       "(30.0, 50.0]      3\n",
       "(10.0, 20.0]      2\n",
       "(-0.001, 10.0]    2\n",
       "Name: customized_age_grp, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],include_lowest=True)\n",
    "ages[\"customized_age_grp\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6.907, 30.25]    10\n",
       "(30.25, 53.5]      3\n",
       "(76.75, 100.0]     2\n",
       "(53.5, 76.75]      1\n",
       "Name: equal_width_age, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 觀察等寬劃分下, 每個種組距各出現幾次\n",
    "ages[\"equal_width_age\"].value_counts() # 每個 bin 的值的範圍大小都是一樣的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 等頻劃分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新增欄位 \"equal_freq_age\", 對年齡做等頻劃分\n",
    "ages[\"equal_freq_age\"] = pd.qcut(ages[\"age\"], 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(42.0, 100.0]     4\n",
       "(26.0, 42.0]      4\n",
       "(20.25, 26.0]     4\n",
       "(6.999, 20.25]    4\n",
       "Name: equal_freq_age, dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 觀察等頻劃分下, 每個種組距各出現幾次\n",
    "ages[\"equal_freq_age\"].value_counts() # 每個 bin 的資料筆數是一樣的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>equal_width_age</th>\n",
       "      <th>customized_age_grp</th>\n",
       "      <th>equal_freq_age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(10.0, 20.0]</td>\n",
       "      <td>(6.999, 20.25]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>22</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(20.0, 30.0]</td>\n",
       "      <td>(20.25, 26.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(20.0, 30.0]</td>\n",
       "      <td>(20.25, 26.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>27</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(20.0, 30.0]</td>\n",
       "      <td>(26.0, 42.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(-0.001, 10.0]</td>\n",
       "      <td>(6.999, 20.25]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>21</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(20.0, 30.0]</td>\n",
       "      <td>(20.25, 26.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>23</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(20.0, 30.0]</td>\n",
       "      <td>(20.25, 26.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>37</td>\n",
       "      <td>(30.25, 53.5]</td>\n",
       "      <td>(30.0, 50.0]</td>\n",
       "      <td>(26.0, 42.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>30</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(20.0, 30.0]</td>\n",
       "      <td>(26.0, 42.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>61</td>\n",
       "      <td>(53.5, 76.75]</td>\n",
       "      <td>(50.0, 100.0]</td>\n",
       "      <td>(42.0, 100.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>45</td>\n",
       "      <td>(30.25, 53.5]</td>\n",
       "      <td>(30.0, 50.0]</td>\n",
       "      <td>(42.0, 100.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>41</td>\n",
       "      <td>(30.25, 53.5]</td>\n",
       "      <td>(30.0, 50.0]</td>\n",
       "      <td>(26.0, 42.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>9</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(-0.001, 10.0]</td>\n",
       "      <td>(6.999, 20.25]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>18</td>\n",
       "      <td>(6.907, 30.25]</td>\n",
       "      <td>(10.0, 20.0]</td>\n",
       "      <td>(6.999, 20.25]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>80</td>\n",
       "      <td>(76.75, 100.0]</td>\n",
       "      <td>(50.0, 100.0]</td>\n",
       "      <td>(42.0, 100.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>100</td>\n",
       "      <td>(76.75, 100.0]</td>\n",
       "      <td>(50.0, 100.0]</td>\n",
       "      <td>(42.0, 100.0]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    age equal_width_age customized_age_grp  equal_freq_age\n",
       "0    18  (6.907, 30.25]       (10.0, 20.0]  (6.999, 20.25]\n",
       "1    22  (6.907, 30.25]       (20.0, 30.0]   (20.25, 26.0]\n",
       "2    25  (6.907, 30.25]       (20.0, 30.0]   (20.25, 26.0]\n",
       "3    27  (6.907, 30.25]       (20.0, 30.0]    (26.0, 42.0]\n",
       "4     7  (6.907, 30.25]     (-0.001, 10.0]  (6.999, 20.25]\n",
       "5    21  (6.907, 30.25]       (20.0, 30.0]   (20.25, 26.0]\n",
       "6    23  (6.907, 30.25]       (20.0, 30.0]   (20.25, 26.0]\n",
       "7    37   (30.25, 53.5]       (30.0, 50.0]    (26.0, 42.0]\n",
       "8    30  (6.907, 30.25]       (20.0, 30.0]    (26.0, 42.0]\n",
       "9    61   (53.5, 76.75]      (50.0, 100.0]   (42.0, 100.0]\n",
       "10   45   (30.25, 53.5]       (30.0, 50.0]   (42.0, 100.0]\n",
       "11   41   (30.25, 53.5]       (30.0, 50.0]    (26.0, 42.0]\n",
       "12    9  (6.907, 30.25]     (-0.001, 10.0]  (6.999, 20.25]\n",
       "13   18  (6.907, 30.25]       (10.0, 20.0]  (6.999, 20.25]\n",
       "14   80  (76.75, 100.0]      (50.0, 100.0]   (42.0, 100.0]\n",
       "15  100  (76.75, 100.0]      (50.0, 100.0]   (42.0, 100.0]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}