{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# [教學目標]\n", "- 以下程式碼將示範在 python 如何利用 pandas.cut 與 .qcut 計算出數據的離散化標籤" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# [範例重點]\n", "- pandas.cut 的等寬劃分效果 (In[3], Out[4])\n", "- pandas.qcut 的等頻劃分效果 (In[5], Out[6])" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# 載入套件\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# 初始設定 Ages 的資料\n", "ages = pd.DataFrame({\"age\": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等寬劃分" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_width_age\", 對年齡做等寬劃分\n", "ages[\"equal_width_age\"] = pd.cut(ages[\"age\"], 4)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6.907, 30.25] 10\n", "(30.25, 53.5] 3\n", "(76.75, 100.0] 2\n", "(53.5, 76.75] 1\n", "Name: equal_width_age, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等寬劃分下, 每個種組距各出現幾次\n", "ages[\"equal_width_age\"].value_counts() # 每個 bin 的值的範圍大小都是一樣的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等頻劃分" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_freq_age\", 對年齡做等頻劃分\n", "ages[\"equal_freq_age\"] = pd.qcut(ages[\"age\"], 4)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42.0, 100.0] 4\n", "(26.0, 42.0] 4\n", "(20.25, 26.0] 4\n", "(6.999, 20.25] 4\n", "Name: equal_freq_age, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等頻劃分下, 每個種組距各出現幾次\n", "ages[\"equal_freq_age\"].value_counts() # 每個 bin 的資料筆數是一樣的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 作業\n", "新增一個欄位 `customized_age_grp`,把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組,'(' 表示不包含, ']' 表示包含\n", "\n", "Hints: 執行 ??pd.cut(),了解提供其中 bins 這個參數的使用方式" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# 作業\n", "- 新增一個欄位 `customized_age_grp`,把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組,\n", "'(' 表示不包含, ']' 表示包含 \n", "- Hints: 執行 ??pd.cut(),了解提供其中 bins 這個參數的使用方式" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# [作業目標]\n", "- 請同學試著查詢 pandas.cut 這個函數還有哪些參數, 藉由改動參數以達成目標\n", "- 藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# [作業重點]\n", "- 仿照 In[3], In[4] 的語法, 並設定 pd.cut 的參數以指定間距" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# 載入套件\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# 初始設定 Ages 的資料\n", "ages = pd.DataFrame({\"age\": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等寬劃分" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_width_age\", 對年齡做等寬劃分\n", "ages[\"equal_width_age\"] = pd.cut(ages[\"age\"], 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "新增一個欄位 customized_age_grp,把 age 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組, '(' 表示不包含, ']' 表示包含\n", "\n", "
[Pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html?highlight=cut)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20, 30] 6\n", "(50, 100] 3\n", "(30, 50] 3\n", "(10, 20] 2\n", "(0, 10] 2\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100])\n", "ages[\"customized_age_grp\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 2\n", "3 2\n", "4 0\n", "5 2\n", "6 2\n", "7 3\n", "8 2\n", "9 4\n", "10 3\n", "11 3\n", "12 0\n", "13 1\n", "14 4\n", "15 4\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],labels=False)\n", "ages[\"customized_age_grp\"]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2 6\n", "4 3\n", "3 3\n", "1 2\n", "0 2\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (10, 20]\n", "1 (20, 30]\n", "2 (20, 30]\n", "3 (20, 30]\n", "4 (0, 10]\n", "5 (20, 30]\n", "6 (20, 30]\n", "7 (30, 50]\n", "8 (20, 30]\n", "9 (50, 100]\n", "10 (30, 50]\n", "11 (30, 50]\n", "12 (0, 10]\n", "13 (10, 20]\n", "14 (50, 100]\n", "15 (50, 100]\n", "Name: customized_age_grp, dtype: category\n", "Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 50] < (50, 100]]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],include_lowest=False)\n", "ages[\"customized_age_grp\"]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20.0, 30.0] 6\n", "(50.0, 100.0] 3\n", "(30.0, 50.0] 3\n", "(10.0, 20.0] 2\n", "(-0.001, 10.0] 2\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],include_lowest=True)\n", "ages[\"customized_age_grp\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6.907, 30.25] 10\n", "(30.25, 53.5] 3\n", "(76.75, 100.0] 2\n", "(53.5, 76.75] 1\n", "Name: equal_width_age, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等寬劃分下, 每個種組距各出現幾次\n", "ages[\"equal_width_age\"].value_counts() # 每個 bin 的值的範圍大小都是一樣的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等頻劃分" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_freq_age\", 對年齡做等頻劃分\n", "ages[\"equal_freq_age\"] = pd.qcut(ages[\"age\"], 4)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42.0, 100.0] 4\n", "(26.0, 42.0] 4\n", "(20.25, 26.0] 4\n", "(6.999, 20.25] 4\n", "Name: equal_freq_age, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等頻劃分下, 每個種組距各出現幾次\n", "ages[\"equal_freq_age\"].value_counts() # 每個 bin 的資料筆數是一樣的" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageequal_width_agecustomized_age_grpequal_freq_age
018(6.907, 30.25](10.0, 20.0](6.999, 20.25]
122(6.907, 30.25](20.0, 30.0](20.25, 26.0]
225(6.907, 30.25](20.0, 30.0](20.25, 26.0]
327(6.907, 30.25](20.0, 30.0](26.0, 42.0]
47(6.907, 30.25](-0.001, 10.0](6.999, 20.25]
521(6.907, 30.25](20.0, 30.0](20.25, 26.0]
623(6.907, 30.25](20.0, 30.0](20.25, 26.0]
737(30.25, 53.5](30.0, 50.0](26.0, 42.0]
830(6.907, 30.25](20.0, 30.0](26.0, 42.0]
961(53.5, 76.75](50.0, 100.0](42.0, 100.0]
1045(30.25, 53.5](30.0, 50.0](42.0, 100.0]
1141(30.25, 53.5](30.0, 50.0](26.0, 42.0]
129(6.907, 30.25](-0.001, 10.0](6.999, 20.25]
1318(6.907, 30.25](10.0, 20.0](6.999, 20.25]
1480(76.75, 100.0](50.0, 100.0](42.0, 100.0]
15100(76.75, 100.0](50.0, 100.0](42.0, 100.0]
\n", "
" ], "text/plain": [ " age equal_width_age customized_age_grp equal_freq_age\n", "0 18 (6.907, 30.25] (10.0, 20.0] (6.999, 20.25]\n", "1 22 (6.907, 30.25] (20.0, 30.0] (20.25, 26.0]\n", "2 25 (6.907, 30.25] (20.0, 30.0] (20.25, 26.0]\n", "3 27 (6.907, 30.25] (20.0, 30.0] (26.0, 42.0]\n", "4 7 (6.907, 30.25] (-0.001, 10.0] (6.999, 20.25]\n", "5 21 (6.907, 30.25] (20.0, 30.0] (20.25, 26.0]\n", "6 23 (6.907, 30.25] (20.0, 30.0] (20.25, 26.0]\n", "7 37 (30.25, 53.5] (30.0, 50.0] (26.0, 42.0]\n", "8 30 (6.907, 30.25] (20.0, 30.0] (26.0, 42.0]\n", "9 61 (53.5, 76.75] (50.0, 100.0] (42.0, 100.0]\n", "10 45 (30.25, 53.5] (30.0, 50.0] (42.0, 100.0]\n", "11 41 (30.25, 53.5] (30.0, 50.0] (26.0, 42.0]\n", "12 9 (6.907, 30.25] (-0.001, 10.0] (6.999, 20.25]\n", "13 18 (6.907, 30.25] (10.0, 20.0] (6.999, 20.25]\n", "14 80 (76.75, 100.0] (50.0, 100.0] (42.0, 100.0]\n", "15 100 (76.75, 100.0] (50.0, 100.0] (42.0, 100.0]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }