# [教學目標]
- 以下程式碼將示範在 python 如何利用 pandas.cut 與 .qcut 計算出數據的離散化標籤

# [範例重點]
- pandas.cut 的等寬劃分效果 (In[3], Out[4])
- pandas.qcut 的等頻劃分效果 (In[5], Out[6])

In [1]:
# 載入套件
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# 初始設定 Ages 的資料
ages = pd.DataFrame({"age": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})

#### 等寬劃分

In [3]:
# 新增欄位 "equal_width_age", 對年齡做等寬劃分
ages["equal_width_age"] = pd.cut(ages["age"], 4)

In [4]:
# 觀察等寬劃分下, 每個種組距各出現幾次
ages["equal_width_age"].value_counts() # 每個 bin 的值的範圍大小都是一樣的

(6.907, 30.25]    10
(30.25, 53.5]      3
(76.75, 100.0]     2
(53.5, 76.75]      1
Name: equal_width_age, dtype: int64

#### 等頻劃分

In [5]:
# 新增欄位 "equal_freq_age", 對年齡做等頻劃分
ages["equal_freq_age"] = pd.qcut(ages["age"], 4)

In [6]:
# 觀察等頻劃分下, 每個種組距各出現幾次
ages["equal_freq_age"].value_counts() # 每個 bin 的資料筆數是一樣的

(42.0, 100.0]     4
(26.0, 42.0]      4
(20.25, 26.0]     4
(6.999, 20.25]    4
Name: equal_freq_age, dtype: int64

### 作業
新增一個欄位 `customized_age_grp`，把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組，'(' 表示不包含, ']' 表示包含

Hints: 執行 ??pd.cut()，了解提供其中 bins 這個參數的使用方式

# 作業
- 新增一個欄位 `customized_age_grp`，把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組，
'(' 表示不包含, ']' 表示包含  
- Hints: 執行 ??pd.cut()，了解提供其中 bins 這個參數的使用方式

# [作業目標]
- 請同學試著查詢 pandas.cut 這個函數還有哪些參數, 藉由改動參數以達成目標
- 藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具

# [作業重點]
- 仿照 In[3], In[4] 的語法, 並設定 pd.cut 的參數以指定間距

In [7]:
# 載入套件
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
# 初始設定 Ages 的資料
ages = pd.DataFrame({"age": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})

#### 等寬劃分

In [9]:
# 新增欄位 "equal_width_age", 對年齡做等寬劃分
ages["equal_width_age"] = pd.cut(ages["age"], 4)

新增一個欄位 customized_age_grp，把 age 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組， '(' 表示不包含, ']' 表示包含

</br>[Pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html?highlight=cut)

In [10]:
ages["customized_age_grp"] = pd.cut(ages["age"],bins=[0,10,20,30,50,100])
ages["customized_age_grp"].value_counts()

(20, 30]     6
(50, 100]    3
(30, 50]     3
(10, 20]     2
(0, 10]      2
Name: customized_age_grp, dtype: int64

藉由查詢與改動參數的過程, 熟悉查詢函數的方法與理解參數性質, 並了解數值的離散化的調整工具

In [11]:
ages["customized_age_grp"] = pd.cut(ages["age"],bins=[0,10,20,30,50,100],labels=False)
ages["customized_age_grp"]

0     1
1     2
2     2
3     2
4     0
5     2
6     2
7     3
8     2
9     4
10    3
11    3
12    0
13    1
14    4
15    4
Name: customized_age_grp, dtype: int64

In [12]:
ages["customized_age_grp"].value_counts()

2    6
4    3
3    3
1    2
0    2
Name: customized_age_grp, dtype: int64

In [13]:
ages["customized_age_grp"] = pd.cut(ages["age"],bins=[0,10,20,30,50,100],include_lowest=False)
ages["customized_age_grp"]

0      (10, 20]
1      (20, 30]
2      (20, 30]
3      (20, 30]
4       (0, 10]
5      (20, 30]
6      (20, 30]
7      (30, 50]
8      (20, 30]
9     (50, 100]
10     (30, 50]
11     (30, 50]
12      (0, 10]
13     (10, 20]
14    (50, 100]
15    (50, 100]
Name: customized_age_grp, dtype: category
Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 50] < (50, 100]]

In [14]:
ages["customized_age_grp"] = pd.cut(ages["age"],bins=[0,10,20,30,50,100],include_lowest=True)
ages["customized_age_grp"].value_counts()

(20.0, 30.0]      6
(50.0, 100.0]     3
(30.0, 50.0]      3
(10.0, 20.0]      2
(-0.001, 10.0]    2
Name: customized_age_grp, dtype: int64

In [15]:
# 觀察等寬劃分下, 每個種組距各出現幾次
ages["equal_width_age"].value_counts() # 每個 bin 的值的範圍大小都是一樣的

(6.907, 30.25]    10
(30.25, 53.5]      3
(76.75, 100.0]     2
(53.5, 76.75]      1
Name: equal_width_age, dtype: int64

#### 等頻劃分

In [16]:
# 新增欄位 "equal_freq_age", 對年齡做等頻劃分
ages["equal_freq_age"] = pd.qcut(ages["age"], 4)

In [17]:
# 觀察等頻劃分下, 每個種組距各出現幾次
ages["equal_freq_age"].value_counts() # 每個 bin 的資料筆數是一樣的

(42.0, 100.0]     4
(26.0, 42.0]      4
(20.25, 26.0]     4
(6.999, 20.25]    4
Name: equal_freq_age, dtype: int64

In [18]:
ages

Unnamed: 0,age,equal_width_age,customized_age_grp,equal_freq_age
0,18,"(6.907, 30.25]","(10.0, 20.0]","(6.999, 20.25]"
1,22,"(6.907, 30.25]","(20.0, 30.0]","(20.25, 26.0]"
2,25,"(6.907, 30.25]","(20.0, 30.0]","(20.25, 26.0]"
3,27,"(6.907, 30.25]","(20.0, 30.0]","(26.0, 42.0]"
4,7,"(6.907, 30.25]","(-0.001, 10.0]","(6.999, 20.25]"
5,21,"(6.907, 30.25]","(20.0, 30.0]","(20.25, 26.0]"
6,23,"(6.907, 30.25]","(20.0, 30.0]","(20.25, 26.0]"
7,37,"(30.25, 53.5]","(30.0, 50.0]","(26.0, 42.0]"
8,30,"(6.907, 30.25]","(20.0, 30.0]","(26.0, 42.0]"
9,61,"(53.5, 76.75]","(50.0, 100.0]","(42.0, 100.0]"
