{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# [教學目標]\n", "- 以下程式碼將示範在 python 如何利用 pandas.cut 與 .qcut 計算出數據的離散化標籤" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# [範例重點]\n", "- pandas.cut 的等寬劃分效果 (In[3], Out[4])\n", "- pandas.qcut 的等頻劃分效果 (In[5], Out[6])" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# 載入套件\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# 初始設定 Ages 的資料\n", "ages = pd.DataFrame({\"age\": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等寬劃分" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_width_age\", 對年齡做等寬劃分\n", "ages[\"equal_width_age\"] = pd.cut(ages[\"age\"], 4)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6.907, 30.25] 10\n", "(30.25, 53.5] 3\n", "(76.75, 100.0] 2\n", "(53.5, 76.75] 1\n", "Name: equal_width_age, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等寬劃分下, 每個種組距各出現幾次\n", "ages[\"equal_width_age\"].value_counts() # 每個 bin 的值的範圍大小都是一樣的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等頻劃分" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_freq_age\", 對年齡做等頻劃分\n", "ages[\"equal_freq_age\"] = pd.qcut(ages[\"age\"], 4)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42.0, 100.0] 4\n", "(26.0, 42.0] 4\n", "(20.25, 26.0] 4\n", "(6.999, 20.25] 4\n", "Name: equal_freq_age, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等頻劃分下, 每個種組距各出現幾次\n", "ages[\"equal_freq_age\"].value_counts() # 每個 bin 的資料筆數是一樣的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 作業\n", "新增一個欄位 `customized_age_grp`,把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組,'(' 表示不包含, ']' 表示包含\n", "\n", "Hints: 執行 ??pd.cut(),了解提供其中 bins 這個參數的使用方式" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# 作業\n", "- 新增一個欄位 `customized_age_grp`,把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組,\n", "'(' 表示不包含, ']' 表示包含 \n", "- Hints: 執行 ??pd.cut(),了解提供其中 bins 這個參數的使用方式" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# [作業目標]\n", "- 請同學試著查詢 pandas.cut 這個函數還有哪些參數, 藉由改動參數以達成目標\n", "- 藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# [作業重點]\n", "- 仿照 In[3], In[4] 的語法, 並設定 pd.cut 的參數以指定間距" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# 載入套件\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# 初始設定 Ages 的資料\n", "ages = pd.DataFrame({\"age\": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等寬劃分" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_width_age\", 對年齡做等寬劃分\n", "ages[\"equal_width_age\"] = pd.cut(ages[\"age\"], 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "新增一個欄位 customized_age_grp,把 age 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組, '(' 表示不包含, ']' 表示包含\n", "\n", "[Pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html?highlight=cut)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20, 30] 6\n", "(50, 100] 3\n", "(30, 50] 3\n", "(10, 20] 2\n", "(0, 10] 2\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100])\n", "ages[\"customized_age_grp\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 2\n", "3 2\n", "4 0\n", "5 2\n", "6 2\n", "7 3\n", "8 2\n", "9 4\n", "10 3\n", "11 3\n", "12 0\n", "13 1\n", "14 4\n", "15 4\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],labels=False)\n", "ages[\"customized_age_grp\"]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2 6\n", "4 3\n", "3 3\n", "1 2\n", "0 2\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (10, 20]\n", "1 (20, 30]\n", "2 (20, 30]\n", "3 (20, 30]\n", "4 (0, 10]\n", "5 (20, 30]\n", "6 (20, 30]\n", "7 (30, 50]\n", "8 (20, 30]\n", "9 (50, 100]\n", "10 (30, 50]\n", "11 (30, 50]\n", "12 (0, 10]\n", "13 (10, 20]\n", "14 (50, 100]\n", "15 (50, 100]\n", "Name: customized_age_grp, dtype: category\n", "Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 50] < (50, 100]]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],include_lowest=False)\n", "ages[\"customized_age_grp\"]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20.0, 30.0] 6\n", "(50.0, 100.0] 3\n", "(30.0, 50.0] 3\n", "(10.0, 20.0] 2\n", "(-0.001, 10.0] 2\n", "Name: customized_age_grp, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ages[\"customized_age_grp\"] = pd.cut(ages[\"age\"],bins=[0,10,20,30,50,100],include_lowest=True)\n", "ages[\"customized_age_grp\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6.907, 30.25] 10\n", "(30.25, 53.5] 3\n", "(76.75, 100.0] 2\n", "(53.5, 76.75] 1\n", "Name: equal_width_age, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等寬劃分下, 每個種組距各出現幾次\n", "ages[\"equal_width_age\"].value_counts() # 每個 bin 的值的範圍大小都是一樣的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 等頻劃分" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# 新增欄位 \"equal_freq_age\", 對年齡做等頻劃分\n", "ages[\"equal_freq_age\"] = pd.qcut(ages[\"age\"], 4)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42.0, 100.0] 4\n", "(26.0, 42.0] 4\n", "(20.25, 26.0] 4\n", "(6.999, 20.25] 4\n", "Name: equal_freq_age, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 觀察等頻劃分下, 每個種組距各出現幾次\n", "ages[\"equal_freq_age\"].value_counts() # 每個 bin 的資料筆數是一樣的" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "equal_width_age | \n", "customized_age_grp | \n", "equal_freq_age | \n", "
---|---|---|---|---|
0 | \n", "18 | \n", "(6.907, 30.25] | \n", "(10.0, 20.0] | \n", "(6.999, 20.25] | \n", "
1 | \n", "22 | \n", "(6.907, 30.25] | \n", "(20.0, 30.0] | \n", "(20.25, 26.0] | \n", "
2 | \n", "25 | \n", "(6.907, 30.25] | \n", "(20.0, 30.0] | \n", "(20.25, 26.0] | \n", "
3 | \n", "27 | \n", "(6.907, 30.25] | \n", "(20.0, 30.0] | \n", "(26.0, 42.0] | \n", "
4 | \n", "7 | \n", "(6.907, 30.25] | \n", "(-0.001, 10.0] | \n", "(6.999, 20.25] | \n", "
5 | \n", "21 | \n", "(6.907, 30.25] | \n", "(20.0, 30.0] | \n", "(20.25, 26.0] | \n", "
6 | \n", "23 | \n", "(6.907, 30.25] | \n", "(20.0, 30.0] | \n", "(20.25, 26.0] | \n", "
7 | \n", "37 | \n", "(30.25, 53.5] | \n", "(30.0, 50.0] | \n", "(26.0, 42.0] | \n", "
8 | \n", "30 | \n", "(6.907, 30.25] | \n", "(20.0, 30.0] | \n", "(26.0, 42.0] | \n", "
9 | \n", "61 | \n", "(53.5, 76.75] | \n", "(50.0, 100.0] | \n", "(42.0, 100.0] | \n", "
10 | \n", "45 | \n", "(30.25, 53.5] | \n", "(30.0, 50.0] | \n", "(42.0, 100.0] | \n", "
11 | \n", "41 | \n", "(30.25, 53.5] | \n", "(30.0, 50.0] | \n", "(26.0, 42.0] | \n", "
12 | \n", "9 | \n", "(6.907, 30.25] | \n", "(-0.001, 10.0] | \n", "(6.999, 20.25] | \n", "
13 | \n", "18 | \n", "(6.907, 30.25] | \n", "(10.0, 20.0] | \n", "(6.999, 20.25] | \n", "
14 | \n", "80 | \n", "(76.75, 100.0] | \n", "(50.0, 100.0] | \n", "(42.0, 100.0] | \n", "
15 | \n", "100 | \n", "(76.75, 100.0] | \n", "(50.0, 100.0] | \n", "(42.0, 100.0] | \n", "