{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<p style=\"text-align:center\">\n", " <a href=\"https://nbviewer.jupyter.org/github/twMr7/Python-Machine-Learning/blob/master/13-Pandas_Data_Processing.ipynb\">\n", " Open In Jupyter nbviewer\n", " <img style=\"float: center;\" src=\"https://nbviewer.jupyter.org/static/img/nav_logo.svg\" width=\"120\" />\n", " </a>\n", "</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/twMr7/Python-Machine-Learning/blob/master/13-Pandas_Data_Processing.ipynb)\n", "\n", "# 13. Pandas 資料處理\n", "\n", "[Pandas](http://pandas.pydata.org/) 也是建構於 Numpy 之上,主要設計的定位是用來資料處理及分析。 與 Numpy 陣列不同的地方在,Pandas 表格式的資料容器 `DataFrame` 可以存放及操作異質資料型態,資料欄位可以有標籤,易於處理 missing data、類別資料、與時間序列資料,而且針對常用的資料儲存格式提供了相當廣泛的輸出入的支援。\n", "\n", "一下教材內容節錄自 [Pandas 官方文件](http://pandas.pydata.org/pandas-docs/stable/index.html)。\n", "\n", "+ [**13.1 Series 與 DataFrame 基本認識**](#basic-datatype)\n", "+ [**13.2 資料內容選取**](#indexing-selecting)\n", "+ [**13.3 新增、刪除與合併**](#append-concat)\n", "+ [**13.4 深入檢視**](#inspecting-data)\n", "+ [**13.5 分群及排序**](#grouping-sorting)\n", "+ [**13.6 漏失數據處理**](#missing-data)\n", "+ [**13.7 時間序列**](#time-series)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### § 使用 `pandas` 套件" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"basic-datatype\"></a>\n", "\n", "## 13.1 `Series` 與 `DataFrame` 基本認識\n", "\n", "Pandas 主要的資料結構是 `Series` 與 `DataFrame`。\n", "+ `Series` - 一維,有標籤,同質性(homogeneously-typed)的資料結構。\n", "+ `DataFrame` - 二維,有欄位標籤及列記錄標籤,欄位異質性(heterogeneously-typed )的資料結構。\n", "\n", "標籤就像 Dict 容器的 Key,可以是字串、數值、時間等可以當成 Key 的資料型態。 Series 可以想成是每一個元素位置都帶有標籤的 **row向量** 或 **column向量**,由 DataFrame 取出某個 row 或某個 column 就是一個 Series,取出的 Series 就帶有原本的 **row標籤** 或 **column標籤**。 \n", "\n", "在建立 DataFrame 或 Series 時,若沒有指定標籤,預設會使用位置順序的數值序號。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 3.0\n", "2 5.0\n", "3 NaN\n", "4 6.0\n", "5 8.0\n", "dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a Series by passing a list of values, letting pandas create a default integer index\n", "pd.Series([1, 3, 5, np.nan, 6, 8])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DatetimeIndex(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04',\n", " '2019-04-05', '2019-04-06'],\n", " dtype='datetime64[ns]', freq='D')\n" ] } ], "source": [ "# Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns\n", "dates = pd.date_range('20190401', periods=6)\n", "print(dates)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))\n", "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>2013-01-02</td>\n", " <td>1.0</td>\n", " <td>3</td>\n", " <td>test</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>2013-01-02</td>\n", " <td>1.0</td>\n", " <td>3</td>\n", " <td>train</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1.0</td>\n", " <td>2013-01-02</td>\n", " <td>1.0</td>\n", " <td>3</td>\n", " <td>test</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.0</td>\n", " <td>2013-01-02</td>\n", " <td>1.0</td>\n", " <td>3</td>\n", " <td>train</td>\n", " <td>foo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E F\n", "0 1.0 2013-01-02 1.0 3 test foo\n", "1 1.0 2013-01-02 1.0 3 train foo\n", "2 1.0 2013-01-02 1.0 3 test foo\n", "3 1.0 2013-01-02 1.0 3 train foo" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a DataFrame by passing a dict of objects that can be converted to series-like.\n", "# 注意: 單一值會自動 broadcast\n", "df2 = pd.DataFrame({'A': 1.,\n", " 'B': pd.Timestamp('20130102'),\n", " 'C': pd.Series(1, index=list(range(4)), dtype='float32'),\n", " 'D': np.array([3] * 4, dtype='int32'),\n", " 'E': pd.Categorical([\"test\", \"train\", \"test\", \"train\"]),\n", " 'F': 'foo'})\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### § 基本數據檢視\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視前幾筆\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視後幾筆\n", "df.tail(3)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04',\n", " '2019-04-05', '2019-04-06'],\n", " dtype='datetime64[ns]', freq='D')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視記錄標籤\n", "df.index" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['A', 'B', 'C', 'D'], dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視記錄欄位標籤\n", "df.columns" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A float64\n", "B float64\n", "C float64\n", "D float64\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視資料類型\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.4615414 , -1.21661998, 0.70400939, 0.27153872],\n", " [ 0.98339876, -0.70162361, 0.89459705, 0.09695603],\n", " [-0.29496219, -1.07716089, -0.83439422, -1.16282439],\n", " [-0.96200379, 0.15779416, -2.0703014 , 0.22941439],\n", " [-0.17872924, -0.63933357, -0.7246541 , 0.37790513],\n", " [ 1.13209463, -0.98746243, 0.53697748, 1.95374221]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 資料都是數值的話,轉成 numpy 陣列非常快\n", "# Pandas 版本 < 0.24 要使用 DataFrame.values\n", "df.to_numpy()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A float64\n", "B datetime64[ns]\n", "C float32\n", "D int32\n", "E category\n", "F object\n", "dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視資料類型\n", "df2.dtypes" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],\n", " [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],\n", " [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],\n", " [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],\n", " dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 異質欄位資料也可以轉成 numpy 陣列的話,但成本較高,通常在 numpy 也不會比較容易處理異質資料\n", "df2.to_numpy()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>6.000000</td>\n", " <td>6.000000</td>\n", " <td>6.000000</td>\n", " <td>6.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>0.190223</td>\n", " <td>-0.744068</td>\n", " <td>-0.248961</td>\n", " <td>0.294455</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>0.811051</td>\n", " <td>0.493887</td>\n", " <td>1.159299</td>\n", " <td>0.992736</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>-0.962004</td>\n", " <td>-1.216620</td>\n", " <td>-2.070301</td>\n", " <td>-1.162824</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>-0.265904</td>\n", " <td>-1.054736</td>\n", " <td>-0.806959</td>\n", " <td>0.130071</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>0.141406</td>\n", " <td>-0.844543</td>\n", " <td>-0.093838</td>\n", " <td>0.250477</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>0.852934</td>\n", " <td>-0.654906</td>\n", " <td>0.662251</td>\n", " <td>0.351314</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>1.132095</td>\n", " <td>0.157794</td>\n", " <td>0.894597</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "count 6.000000 6.000000 6.000000 6.000000\n", "mean 0.190223 -0.744068 -0.248961 0.294455\n", "std 0.811051 0.493887 1.159299 0.992736\n", "min -0.962004 -1.216620 -2.070301 -1.162824\n", "25% -0.265904 -1.054736 -0.806959 0.130071\n", "50% 0.141406 -0.844543 -0.093838 0.250477\n", "75% 0.852934 -0.654906 0.662251 0.351314\n", "max 1.132095 0.157794 0.894597 1.953742" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 基本統計描述(column-wise):平均值、標準差、最小值、第一四分位數、中位數、第二四分位數、最大值\n", "df.describe()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>2019-04-01 00:00:00</th>\n", " <th>2019-04-02 00:00:00</th>\n", " <th>2019-04-03 00:00:00</th>\n", " <th>2019-04-04 00:00:00</th>\n", " <th>2019-04-05 00:00:00</th>\n", " <th>2019-04-06 00:00:00</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>A</th>\n", " <td>0.461541</td>\n", " <td>0.983399</td>\n", " <td>-0.294962</td>\n", " <td>-0.962004</td>\n", " <td>-0.178729</td>\n", " <td>1.132095</td>\n", " </tr>\n", " <tr>\n", " <th>B</th>\n", " <td>-1.216620</td>\n", " <td>-0.701624</td>\n", " <td>-1.077161</td>\n", " <td>0.157794</td>\n", " <td>-0.639334</td>\n", " <td>-0.987462</td>\n", " </tr>\n", " <tr>\n", " <th>C</th>\n", " <td>0.704009</td>\n", " <td>0.894597</td>\n", " <td>-0.834394</td>\n", " <td>-2.070301</td>\n", " <td>-0.724654</td>\n", " <td>0.536977</td>\n", " </tr>\n", " <tr>\n", " <th>D</th>\n", " <td>0.271539</td>\n", " <td>0.096956</td>\n", " <td>-1.162824</td>\n", " <td>0.229414</td>\n", " <td>0.377905</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 2019-04-01 2019-04-02 2019-04-03 2019-04-04 2019-04-05 2019-04-06\n", "A 0.461541 0.983399 -0.294962 -0.962004 -0.178729 1.132095\n", "B -1.216620 -0.701624 -1.077161 0.157794 -0.639334 -0.987462\n", "C 0.704009 0.894597 -0.834394 -2.070301 -0.724654 0.536977\n", "D 0.271539 0.096956 -1.162824 0.229414 0.377905 1.953742" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Transpose\n", "df.T" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據某欄位排序\n", "df.sort_values(by='B', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"indexing-selecting\"></a>\n", "\n", "## 13.2 資料內容選取\n", "\n", "### § 直接括號操作\n", "\n", "注意: Pandas 可以直接用中括號 [ ] 選取 DataFrame 的資料,由於括號中的標籤可以是欄位標籤也可以是列序號,在容易混淆誤用時,請儘可能使用下一節中提到的 `loc[]` 或 `iloc[]` 語法。\n", "\n", "DataFrame 可以直接使用中括號 [ ] 指定欄位標籤存取欄位資料,主要語法為:\n", "```\n", " DataFrame[欄位標籤], 或\n", " DataFrame[欄位標籤清單]\n", "```\n", "若只需要存取單一欄位,而且 columns 標籤是字串形式,也可以使用以下語法(注意不帶字串的引號):\n", "```\n", " DataFrame.欄位標籤\n", "```\n", "\n", "Series 直接使用中括號 [ ] 的語法為:\n", "```\n", " Series[標籤], 或\n", " Series[標籤清單]\n", "```\n", "若只需要存取序列的單一位置,而且 index 標籤是字串形式,也可以使用以下語法(注意不帶字串的引號):\n", "```\n", " Series.index標籤\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 0.461541\n", "2019-04-02 0.983399\n", "2019-04-03 -0.294962\n", "2019-04-04 -0.962004\n", "2019-04-05 -0.178729\n", "2019-04-06 1.132095\n", "Freq: D, Name: A, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 選取單一欄位,返回一個 Series\n", "df['A']" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 0.461541\n", "2019-04-02 0.983399\n", "2019-04-03 -0.294962\n", "2019-04-04 -0.962004\n", "2019-04-05 -0.178729\n", "2019-04-06 1.132095\n", "Freq: D, Name: A, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# df['A'] 語法等同於 df.A\n", "df.A" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 0\n", "2019-04-02 0\n", "2019-04-03 0\n", "2019-04-04 0\n", "2019-04-05 0\n", "2019-04-06 1\n", "Freq: D, Name: A, dtype: int32" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 資料型別轉換\n", "df.A.astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### § 使用 `loc` 及 `iloc` 操作\n", "\n", "建議使用針對 Pandas 資料結構最佳化過的 `loc[]`、和 `iloc[]` 語法,序號(indexing)及片段(slicing)的語法類似 Numpy 陣列。\n", "\n", "| `loc` 存取方式 | Series | DataFrame |\n", "|------------------|--------------------|-------------------------------|\n", "| 標籤 | `s.loc[標籤]` | `df.loc[標籤, 標籤]` |\n", "| 標籤清單 | `s.loc[標籤清單]` | `df.loc[標籤清單, 標籤清單]` |\n", "| 標籤片段 | `s.loc[標籤片段]` | `df.loc[標籤片段, 標籤片段]` |\n", "| Boolean 遮罩陣列 | `s.loc[遮罩陣列]` | `df.loc[遮罩陣列, 遮罩陣列]` |\n", "\n", "注意片段語法 `loc[start:stop:step]`,返回結果為封閉區間的 [start, stop],**包含**結束的 stop 項。\n", "\n", "`iloc` 主要以位置順序的數值序號為存取方式。\n", "\n", "| `iloc` 存取方式 | Series | DataFrame |\n", "|------------------|--------------------|-------------------------------|\n", "| 序號 | `s.iloc[序號]` | `df.iloc[序號, 序號]` |\n", "| 序號清單 | `s.iloc[序號清單]` | `df.loc[序號清單, 序號清單]` |\n", "| 序號片段 | `s.iloc[序號片段]` | `df.loc[序號片段, 序號片段]` |\n", "\n", "注意片段語法 `iloc[start:stop:step]`,返回結果為半開放區間的 [start, stop),**不包含**結束的 stop 項。\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9833987590643353" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 loc,選取特定列標籤及欄位標籤位置\n", "df.loc['2019-04-02', 'A']" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0.983399\n", "B -0.701624\n", "C 0.894597\n", "D 0.096956\n", "Name: 2019-04-02 00:00:00, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 loc,選取特定列標籤,欄位標籤省略則預設為全選\n", "df.loc['2019-04-02']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09695602819994861" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 loc 選取特定列後,再指定欄位標籤\n", "df.loc['2019-04-02'].D" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>C</th>\n", " <th>A</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>0.704009</td>\n", " <td>0.461541</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>0.894597</td>\n", " <td>0.983399</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-0.834394</td>\n", " <td>-0.294962</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>-2.070301</td>\n", " <td>-0.962004</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.724654</td>\n", " <td>-0.178729</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>0.536977</td>\n", " <td>1.132095</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A C A\n", "2019-04-01 0.461541 0.704009 0.461541\n", "2019-04-02 0.983399 0.894597 0.983399\n", "2019-04-03 -0.294962 -0.834394 -0.294962\n", "2019-04-04 -0.962004 -2.070301 -0.962004\n", "2019-04-05 -0.178729 -0.724654 -0.178729\n", "2019-04-06 1.132095 0.536977 1.132095" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 所有列(slicing 語法)及部分欄位選取,注意標籤可重複\n", "df.loc[:, ['A', 'C', 'A']]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C\n", "2019-04-02 0.983399 -0.701624 0.894597\n", "2019-04-03 -0.294962 -1.077161 -0.834394\n", "2019-04-04 -0.962004 0.157794 -2.070301" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 部份列片段,及部分欄位片段選取,注意片段有包含 stop 項\n", "df.loc['20190402':'20190404', 'A':'C']" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.15779415836220198" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 iloc,序號選取特定列及欄位\n", "df.iloc[3, 1]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A -0.962004\n", "B 0.157794\n", "C -2.070301\n", "D 0.229414\n", "Name: 2019-04-04 00:00:00, dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 iloc,選取特定列序號,欄位序號省略則預設為全選\n", "df.iloc[3]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-2.0703014049380495" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 iloc 選取特定列後,再指定欄位標籤\n", "df.iloc[3].C" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B\n", "2019-04-04 -0.962004 0.157794\n", "2019-04-05 -0.178729 -0.639334" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 類似 numpy 的序號與片段語法,注意片段沒有包含 stop 項\n", "df.iloc[3:5, 0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### § 使用比較運算產生遮罩\n", "\n", "+ **比較運算子** - 運算結果返回 `bool` 陣列。\n", "\n", "| 比較運算操作 | 說明 |\n", "|------------------------|---------------|\n", "| **X < Y** | 小於 |\n", "| **X <= Y** | 小於或等於 |\n", "| **X > Y** | 大於 |\n", "| **X >= Y** | 大於或等於 |\n", "| **X == Y** | 等於 |\n", "| **X != Y** | 不等於 |\n", "| **(條件1) & (條件2)** | 真值邏輯 AND |\n", "| **(條件1) \\| (條件2)** | 真值邏輯 OR |\n", "| **~(條件1)** | 真值邏輯 NOT |\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 True\n", "2019-04-02 True\n", "2019-04-03 False\n", "2019-04-04 False\n", "2019-04-05 False\n", "2019-04-06 True\n", "Freq: D, Name: A, dtype: bool" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 類似 Numpy,比較結果為 Boolean 遮罩\n", "df.A > 0" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Boolean 陣列選取,使用單一欄位的條件\n", "df.loc[df.A > 0]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 True\n", "2019-04-02 True\n", "2019-04-03 False\n", "2019-04-04 True\n", "2019-04-05 False\n", "2019-04-06 True\n", "Freq: D, dtype: bool" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 兩個條件的邏輯 OR 比較,要用 | 符號\n", "(df.A > 0) | (df.B > 0)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df.A > 0) | (df.B > 0)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"append-concat\"></a>\n", "\n", "## 13.3 新增、刪除與合併\n", "\n", "DataFrame 可以直接用括號 [ ] 指定新標籤來新增欄位,也可以用 `loc` 語法指定新的標籤來新增列或欄位數據。 另外也可以使用物件方法及 concat 函式:\n", "+ `DataFrame.append()` - 新增列數據在最後,返回新物件。\n", "+ `DataFrame.assign()` - 新增新欄位,返回新物件。\n", "+ `Series.append()` - 串接另一個 Series,返回新物件。\n", "+ `Pandas.concat()` - 串接 Pandas 的 Series 或 DataFrame 物件。\n", "\n", "刪除使用 drop 物件方法:\n", "+ `DataFrame.drop()` - 刪除指定的行或列標籤,可就地變更。\n", "+ `Series.drop()` - 刪除指定標籤,可就地變更。\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>one</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>one</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>two</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>three</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>four</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>three</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539 one\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956 one\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414 three\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742 three" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 追加一個欄位資料\n", "df['E'] = ['one', 'one', 'two', 'three', 'four', 'three']\n", "df" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>one</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>one</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>two</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>three</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>four</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>three</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>four</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539 one\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956 one\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414 three\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742 three\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 four" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 loc 新增列數據\n", "df.loc[pd.to_datetime('2019-04-07'),:] = pd.Series({'A':0.1, 'B':0.2, 'C':0.3, 'D':0.4, 'E':'four'})\n", "df" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>one</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>one</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>two</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>three</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>three</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E F\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539 one True\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956 one False\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two True\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414 three True\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four False\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742 three True\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 four False" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 loc 給新欄位標籤新增欄位數據\n", "df.loc[:, 'F'] = [True, False, True, True, False, True, False]\n", "df" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 False\n", "2019-04-02 False\n", "2019-04-03 True\n", "2019-04-04 False\n", "2019-04-05 True\n", "2019-04-06 False\n", "2019-04-07 True\n", "Freq: D, Name: E, dtype: bool" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 Series.isin 返回 Boolean 遮罩\n", "df.E.isin(['two', 'four'])" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>two</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E F\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two True\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four False\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 four False" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 Series.isin 返回的遮罩選取\n", "df.loc[df.E.isin(['two', 'four'])]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>one</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>three</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>two</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E F\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956 one False\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414 three True\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two True\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four False\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 four False" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 串接,預設 axis 0\n", "pd.concat([df.iloc[[1,3]], df.loc[df.E.isin(['two', 'four'])]])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>E</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>one</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>one</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>two</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>three</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>three</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>four</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>three</td>\n", " <td>True</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D E F\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539 one True\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956 one False\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two True\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414 three True\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four False\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742 three True\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 four False\n", "2019-04-07 -0.962004 0.157794 -2.070301 0.229414 three True" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 用 append 追加數據列\n", "s3 = df.iloc[3]\n", "s3.name = pd.to_datetime('2019-04-07')\n", "\n", "df.append(s3)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 刪除欄位,可就地刪除\n", "df.drop(columns=['E','F'], inplace=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"inspecting-data\"></a>\n", "\n", "## 13.4 深入檢視\n", "\n", "時常取得的數據資料只有簡短的文字說明,就算有通常也很難完整交代數據的細節。 除了用 `DataFrame.decribe()` 整體檢視外,個別欄位或列 Series 也有敘述統計指標的方法可以用,另外也還有許多常用的方法可以用來深入檢視:\n", "+ 重複或類別數據 - `nunique()`、`unique()`、`value_counts()`、`duplicated()`、`drop_duplicates()`、`equals()`。\n", "+ 極值關係 - `idxmax()`、`idxmin()`。\n", "+ 轉換 - `apply()`、`map()`、`transform()`。\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2715387197756772" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢視某欄位的中位數\n", "df.D.median()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 7\n", "B 7\n", "C 7\n", "D 7\n", "E 4\n", "F 2\n", "dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 各欄位有多少不重複的值\n", "df.nunique(axis=0)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['one', 'two', 'three', 'four'], dtype=object)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 不重複的值是那些(類別)\n", "df.E.unique()\n", "\n", "#df.E.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 2\n", "three 2\n", "four 2\n", "two 1\n", "Name: E, dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 各類別出現次數\n", "df.E.value_counts()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'one'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 某種數據比例條件下,最大/最小比值是 E 欄位的那一種類別\n", "df.loc[(df.A / df.D).idxmax(), 'E']" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 middle\n", "2019-04-02 middle\n", "2019-04-03 low\n", "2019-04-04 low\n", "2019-04-05 low\n", "2019-04-06 high\n", "2019-04-07 middle\n", "Freq: D, dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據數值意義取門檻值,轉成三種類別\n", "cut3levels = lambda row: 'high' if row.A > 1 else ('middle' if row.A > 0 else 'low')\n", "df.apply(cut3levels, axis='columns')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "# Negatives in A 3\n", "# Negatives in B 5\n", "dtype: int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 計算符合特定條件的數據出現次數\n", "negcount = lambda x: x < 0\n", "n_negA = df.A.map(negcount).sum()\n", "n_negB = df.B.map(negcount).sum()\n", "pd.Series([n_negA, n_negB], index=['# Negatives in A', '# Negatives in B'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"grouping-sorting\"></a>\n", "\n", "## 13.5 分群及排序\n", "\n", "DataFrame 及 Series 都有一個 `groupby()` 方法,會根據指定標籤分群返回一個特別的 [`GroupBy`](http://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) 類別,這個類別會在另外套用方法後產生實際分群計算的結果(類似生成函式的概念)。 `GroupBy` 物件也可以用在迴圈中迭代,針對每群個別處理。 `groupby()` 函式的第一個參數 `by` 是分群的依據,也是分群結果的 key index。`by` 參數可以是:\n", "+ 欄位標籤,或欄位標籤清單\n", "+ DataFrame[欄位標籤],或 DataFrame[欄位標籤清單]\n", "+ Series - 由 DataFrame 選取產生的結果\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "E\n", "four 2\n", "one 2\n", "three 2\n", "two 1\n", "dtype: int64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據欄位 E 分群,群中各有多少\n", "df.groupby('E').size()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "E\n", "four 0.0\n", "one 1.0\n", "three 2.0\n", "two 1.0\n", "Name: F, dtype: float64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據欄位 E 分群,計算每群中的 F 欄位有多少 True\n", "df.groupby('E').F.sum()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Group False :\n", " A B C D E F\n", "2019-04-02 0.983399 -0.701624 0.894597 0.096956 one False\n", "2019-04-05 -0.178729 -0.639334 -0.724654 0.377905 four False\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 four False \n", "\n", "Group True :\n", " A B C D E F\n", "2019-04-01 0.461541 -1.216620 0.704009 0.271539 one True\n", "2019-04-03 -0.294962 -1.077161 -0.834394 -1.162824 two True\n", "2019-04-04 -0.962004 0.157794 -2.070301 0.229414 three True\n", "2019-04-06 1.132095 -0.987462 0.536977 1.953742 three True \n", "\n" ] } ], "source": [ "# 根據 F 欄分群,輸出分群結果\n", "for name, group in df.groupby('F'):\n", " print('Group', name, ':')\n", " print(group, '\\n')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "F\n", "False 0.894597\n", "True 0.704009\n", "Name: C, dtype: float64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 F 欄分群,各群中最大的 C 欄值是多少?\n", "df.groupby('F').C.max()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " <tr>\n", " <th>F</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>False</th>\n", " <td>0.301557</td>\n", " <td>-0.380319</td>\n", " <td>0.156648</td>\n", " <td>0.291620</td>\n", " </tr>\n", " <tr>\n", " <th>True</th>\n", " <td>0.084168</td>\n", " <td>-0.780862</td>\n", " <td>-0.415927</td>\n", " <td>0.322968</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "F \n", "False 0.301557 -0.380319 0.156648 0.291620\n", "True 0.084168 -0.780862 -0.415927 0.322968" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 F 欄分群,各群中數值欄位的平均值是多少?\n", "df.groupby('F').mean()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " <tr>\n", " <th>A</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>-1.0</th>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " </tr>\n", " <tr>\n", " <th>0.0</th>\n", " <td>-0.683279</td>\n", " <td>-0.138760</td>\n", " <td>-0.028345</td>\n", " </tr>\n", " <tr>\n", " <th>1.0</th>\n", " <td>-0.844543</td>\n", " <td>0.715787</td>\n", " <td>1.025349</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " B C D\n", "A \n", "-1.0 0.157794 -2.070301 0.229414\n", " 0.0 -0.683279 -0.138760 -0.028345\n", " 1.0 -0.844543 0.715787 1.025349" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 A 欄四捨五入的整數分群並排序,其他各欄位的平均值\n", "df.groupby(df.A.round())['B', 'C', 'D'].mean().sort_index(ascending=True)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>A+</th>\n", " <td>-0.676427</td>\n", " <td>0.608896</td>\n", " <td>0.680559</td>\n", " </tr>\n", " <tr>\n", " <th>A-</th>\n", " <td>-0.519567</td>\n", " <td>-1.209783</td>\n", " <td>-0.185168</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " B C D\n", "A+ -0.676427 0.608896 0.680559\n", "A- -0.519567 -1.209783 -0.185168" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 A 欄正負分群,其他各欄位平均值\n", "cut2levels = lambda row: 'A+' if row.A > 0 else 'A-'\n", "df.groupby(df.apply(cut2levels, axis='columns'))['B', 'C', 'D'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### § 聚合處理 Aggregation\n", "\n", "Pandas 的 `aggregate()`(別名: `agg()`)用來套用函式作整體的向量式運算,與 Numpy 函式不同的是,Pandas 的 `aggregate` 永遠只會針對欄位方向或列方向套用。 `aggregate()` 函式接受的參數為:\n", "+ 函式,或函式清單\n", "+ 函式名稱字串,或函式名稱字串的清單\n", "+ 以標籤當 Key 的字典,對應的值可以是上述的可接受參數\n", "\n", "內建可以直接使用字串名稱的函式: `sum`、`mean`、`min`、`max`、`size`、`count`、`std`、`var`、`sem`(standard error of the mean)、`describe`。" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>min</th>\n", " <th>max</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>A+</th>\n", " <td>-1.216620</td>\n", " <td>0.200000</td>\n", " </tr>\n", " <tr>\n", " <th>A-</th>\n", " <td>-1.077161</td>\n", " <td>0.157794</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " min max\n", "A+ -1.216620 0.200000\n", "A- -1.077161 0.157794" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 A 欄正負分群,B 欄的最大及最小值\n", "df.groupby(df.apply(cut2levels, axis='columns')).B.agg(['min','max'])" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">A</th>\n", " <th colspan=\"2\" halign=\"left\">B</th>\n", " <th colspan=\"2\" halign=\"left\">C</th>\n", " <th colspan=\"2\" halign=\"left\">D</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th>min</th>\n", " <th>max</th>\n", " <th>min</th>\n", " <th>max</th>\n", " <th>min</th>\n", " <th>max</th>\n", " <th>min</th>\n", " <th>max</th>\n", " </tr>\n", " <tr>\n", " <th>F</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>False</th>\n", " <td>-0.178729</td>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.200000</td>\n", " <td>-0.724654</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>0.400000</td>\n", " </tr>\n", " <tr>\n", " <th>True</th>\n", " <td>-0.962004</td>\n", " <td>1.132095</td>\n", " <td>-1.216620</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.704009</td>\n", " <td>-1.162824</td>\n", " <td>1.953742</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D \\\n", " min max min max min max min \n", "F \n", "False -0.178729 0.983399 -0.701624 0.200000 -0.724654 0.894597 0.096956 \n", "True -0.962004 1.132095 -1.216620 0.157794 -2.070301 0.704009 -1.162824 \n", "\n", " \n", " max \n", "F \n", "False 0.400000 \n", "True 1.953742 " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 F 欄分群,各欄位的最大及最小值\n", "df.groupby('F').agg({'A':['min','max'], 'B':['min','max'], 'C':['min','max'], 'D':['min','max']})" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>min</th>\n", " <th>max</th>\n", " </tr>\n", " <tr>\n", " <th>E</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>one</th>\n", " <td>0.704009</td>\n", " <td>0.894597</td>\n", " </tr>\n", " <tr>\n", " <th>four</th>\n", " <td>-0.724654</td>\n", " <td>0.300000</td>\n", " </tr>\n", " <tr>\n", " <th>two</th>\n", " <td>-0.834394</td>\n", " <td>-0.834394</td>\n", " </tr>\n", " <tr>\n", " <th>three</th>\n", " <td>-2.070301</td>\n", " <td>0.536977</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " min max\n", "E \n", "one 0.704009 0.894597\n", "four -0.724654 0.300000\n", "two -0.834394 -0.834394\n", "three -2.070301 0.536977" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根據 E 欄分群,C 欄位的最大及最小值,並以最小值排序\n", "df.groupby('E').C.agg(['min','max']).sort_values(by=['min'], ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"missing-data\"></a>\n", "\n", "## 13.6 漏失數據處理\n", "\n", "漏失數據(missing data,又稱 **NA**)在 Pandas 中主要使用 `numpy.nan`(Not a Number)形態表示,但也不排除使用 Python 的 `None` 來指定。 要注意的是,兩個 `numpy.nan` 互相比較永遠不會相等,但是 `None` 會相等,所以偵測漏失數據要使用 Pandas 提供的 `isna()` 函式。" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(NaN == NaN) is False\n", "(None == None) is True\n" ] } ], "source": [ "# NaN 永遠不會等於 NaN\n", "print('(NaN == NaN) is', np.nan == np.nan)\n", "# None 等於 None\n", "print('(None == None) is', None == None)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>0.461541</td>\n", " <td>NaN</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.983399</td>\n", " <td>NaN</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>NaN</td>\n", " <td>0.157794</td>\n", " <td>NaN</td>\n", " <td>0.229414</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>0.377905</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>1.132095</td>\n", " <td>NaN</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D F\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 NaN\n", "2019-04-03 NaN NaN NaN NaN 1.0\n", "2019-04-04 NaN 0.157794 NaN 0.229414 1.0\n", "2019-04-05 NaN NaN NaN 0.377905 NaN\n", "2019-04-06 1.132095 NaN 0.536977 1.953742 1.0\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 NaN" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 刪除欄位,可就地刪除\n", "df.drop(columns=['E'], inplace=True)\n", "# 刻意製造含 NaN 的數據,不常用這樣的選取方式\n", "dfmiss = df[df > 0]\n", "dfmiss" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01</th>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D F\n", "2019-04-01 False True False False False\n", "2019-04-02 False True False False True\n", "2019-04-03 True True True True False\n", "2019-04-04 True False True False False\n", "2019-04-05 True True True False True\n", "2019-04-06 False True False False False\n", "2019-04-07 False False False False True" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 偵測 NA 返回 Boolean 遮罩,注意:不能使用 dfmiss == np.nan 這樣的比較\n", "dfmiss.isna()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 3\n", "B 5\n", "C 3\n", "D 1\n", "F 3\n", "dtype: int64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 計算各欄位分別有多少漏失數據\n", "dfmiss.isna().sum()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 NaN 2019-04-17\n", "2019-04-03 NaN NaN NaN NaN 1.0 2019-04-17\n", "2019-04-04 NaN 0.157794 NaN 0.229414 1.0 2019-04-17\n", "2019-04-05 NaN NaN NaN 0.377905 NaN 2019-04-17\n", "2019-04-06 1.132095 NaN 0.536977 1.953742 1.0 2019-04-17\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 NaN 2019-04-17\n" ] } ], "source": [ "# 針對 datetime 類型資料,Pandas 內部另外提供了 `NaT` 的資料類型來代表漏失的時間數據。\n", "dfmiss['T'] = pd.Timestamp('20190417')\n", "print(dfmiss)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 NaN NaT\n", "2019-04-03 NaN NaN NaN NaN 1.0 2019-04-17\n", "2019-04-04 NaN 0.157794 NaN 0.229414 1.0 NaT\n", "2019-04-05 NaN NaN NaN 0.377905 NaN NaT\n", "2019-04-06 1.132095 NaN 0.536977 1.953742 1.0 NaT\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 NaN 2019-04-17\n" ] } ], "source": [ "dfmiss.iloc[[1, 3, 4, 5], 5] = None\n", "print(dfmiss)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "column A sum = 2.677034795611951\n", "column A mean = 0.6692586989029877\n", "column A cumsum =\n", " 2019-04-01 0.461541\n", "2019-04-02 1.444940\n", "2019-04-03 NaN\n", "2019-04-04 NaN\n", "2019-04-05 NaN\n", "2019-04-06 2.577035\n", "2019-04-07 2.677035\n", "Freq: D, Name: A, dtype: float64\n" ] } ], "source": [ "# 敘述統計函式,sum 把 NA 當 0,mean, cunsum 掠過\n", "print('column A sum =', dfmiss['A'].sum())\n", "print('column A mean =', dfmiss['A'].mean())\n", "print('column A cumsum =\\n', dfmiss['A'].cumsum())" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>F</th>\n", " <th>T</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "Empty DataFrame\n", "Columns: [A, B, C, D, F, T]\n", "Index: []" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 把有漏失任何欄位值的記錄丟掉,也可指定全部欄位沒有值才丟\n", "dfmiss.dropna()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "【填補前】:\n", " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 NaN NaT\n", "2019-04-03 NaN NaN NaN NaN 1.0 2019-04-17\n", "2019-04-04 NaN 0.157794 NaN 0.229414 1.0 NaT\n", "2019-04-05 NaN NaN NaN 0.377905 NaN NaT\n", "2019-04-06 1.132095 NaN 0.536977 1.953742 1.0 NaT\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 NaN 2019-04-17\n", "\n", "【填補 0】:\n", " A B C D F T\n", "2019-04-01 0.461541 0.000000 0.704009 0.271539 1.0 2019-04-17 00:00:00\n", "2019-04-02 0.983399 0.000000 0.894597 0.096956 0.0 0\n", "2019-04-03 0.000000 0.000000 0.000000 0.000000 1.0 2019-04-17 00:00:00\n", "2019-04-04 0.000000 0.157794 0.000000 0.229414 1.0 0\n", "2019-04-05 0.000000 0.000000 0.000000 0.377905 0.0 0\n", "2019-04-06 1.132095 0.000000 0.536977 1.953742 1.0 0\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 0.0 2019-04-17 00:00:00\n", "\n", "【後向填補】:\n", " A B C D F T\n", "2019-04-01 0.461541 0.157794 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 0.157794 0.894597 0.096956 1.0 2019-04-17\n", "2019-04-03 1.132095 0.157794 0.536977 0.229414 1.0 2019-04-17\n", "2019-04-04 1.132095 0.157794 0.536977 0.229414 1.0 2019-04-17\n", "2019-04-05 1.132095 0.200000 0.536977 0.377905 1.0 2019-04-17\n", "2019-04-06 1.132095 0.200000 0.536977 1.953742 1.0 2019-04-17\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 NaN 2019-04-17\n", "\n", "【前向填補】:\n", " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 1.0 2019-04-17\n", "2019-04-03 0.983399 NaN 0.894597 0.096956 1.0 2019-04-17\n", "2019-04-04 0.983399 0.157794 0.894597 0.229414 1.0 2019-04-17\n", "2019-04-05 0.983399 0.157794 0.894597 0.377905 1.0 2019-04-17\n", "2019-04-06 1.132095 0.157794 0.536977 1.953742 1.0 2019-04-17\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 1.0 2019-04-17\n", "\n" ] } ], "source": [ "# 丟掉很可惜,填補值來用\n", "print('【填補前】:\\n{}\\n'.format(dfmiss))\n", "\n", "print('【填補 0】:\\n{}\\n'.format(dfmiss.fillna(0)))\n", "\n", "print('【後向填補】:\\n{}\\n'.format(dfmiss.fillna(method='bfill')))\n", "\n", "print('【前向填補】:\\n{}\\n'.format(dfmiss.fillna(method='ffill')))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "【填補前】:\n", " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 NaN NaT\n", "2019-04-03 NaN NaN NaN NaN 1.0 2019-04-17\n", "2019-04-04 NaN 0.157794 NaN 0.229414 1.0 NaT\n", "2019-04-05 NaN NaN NaN 0.377905 NaN NaT\n", "2019-04-06 1.132095 NaN 0.536977 1.953742 1.0 NaT\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 NaN 2019-04-17\n", "\n", "【linear 內插填補】:\n", " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 1.0 NaT\n", "2019-04-03 1.020573 NaN 0.805192 0.163185 1.0 2019-04-17\n", "2019-04-04 1.057747 0.157794 0.715787 0.229414 1.0 NaT\n", "2019-04-05 1.094921 0.171863 0.626382 0.377905 1.0 NaT\n", "2019-04-06 1.132095 0.185931 0.536977 1.953742 1.0 NaT\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 1.0 2019-04-17\n", "\n", "【pchip 內插填補】:\n", " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 1.0 NaT\n", "2019-04-03 1.053862 NaN 0.865482 0.139266 1.0 2019-04-17\n", "2019-04-04 1.099728 0.157794 0.787156 0.229414 1.0 NaT\n", "2019-04-05 1.124604 0.171863 0.673145 0.377905 1.0 NaT\n", "2019-04-06 1.132095 0.185931 0.536977 1.953742 1.0 NaT\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 1.0 2019-04-17\n", "\n", "【krogh 內插填補】:\n", " A B C D F T\n", "2019-04-01 0.461541 NaN 0.704009 0.271539 1.0 2019-04-17\n", "2019-04-02 0.983399 NaN 0.894597 0.096956 1.0 NaT\n", "2019-04-03 1.428300 NaN 0.946704 0.703491 1.0 2019-04-17\n", "2019-04-04 1.679327 0.157794 0.886813 0.229414 1.0 NaT\n", "2019-04-05 1.619565 0.171863 0.741410 0.377905 1.0 NaT\n", "2019-04-06 1.132095 0.185931 0.536977 1.953742 1.0 NaT\n", "2019-04-07 0.100000 0.200000 0.300000 0.400000 1.0 2019-04-17\n", "\n" ] } ], "source": [ "# 使用內插值填補,部份方法來自 scipy.interpolate 模組\n", "print('【填補前】:\\n{}\\n'.format(dfmiss))\n", "\n", "print('【linear 內插填補】:\\n{}\\n'.format(dfmiss.interpolate(method='linear')))\n", "\n", "print('【pchip 內插填補】:\\n{}\\n'.format(dfmiss.interpolate(method='pchip')))\n", "\n", "print('【krogh 內插填補】:\\n{}\\n'.format(dfmiss.interpolate(method='krogh')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"time-series\"></a>\n", "\n", "## 13.7 時間序列\n", "\n", "Pandas 在 NumPy 的 `datetime64` 和 `timedelta64` 資料形態的基礎上,建構了相當多針對時間序列作處理的功能。 主要以四種資料類型來處理不同的時間概念:\n", "+ [`Timestamp`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html) - (**Date times**時間概念) 支援時區資訊的特定日期時間,類似 Python 標準函式庫裡的 `datetime.datetime`。\n", "+ [`Timedelta`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html) - (**Time deltas**時間概念) 絕對的連續性時間間距。類似 Python 標準函式庫裡的 `datetime.timedelta`。\n", "+ [`Period`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.html) - (**Time spans**時間概念) 週期性時間間隔的時間點。\n", "+ [`DateOffset`](http://pandas.pydata.org/pandas-docs/stable/reference/offset_frequency.html) - (**Date offsets**時間概念) 用於不同曆法的相對時間間距。\n", "\n", "相對於時間概念的操作,Pandas 提供的資料形態:\n", "\n", "| 時間概念 | 純量類別 | 陣列類別 | Pandas 資料類別 | 主要建立方法 |\n", "|--------------------|--------------|------------------|-------------------|-----------------------------------------|\n", "| **Date times** | `Timestamp` | `DatetimeIndex` | `datetime64[ns]` | `to_datetime()` 或 `date_range()` |\n", "| **Time deltas** | `Timedelta` | `TimedeltaIndex` | `timedelta64[ns]` | `to_timedelta()` 或 `timedelta_range()` |\n", "| **Time spans** | `Period` | `PeriodIndex` | `period[freq]` | `Period()` 或 `period_range()` |\n", "| **Date offsets** | `DateOffset` | None | None | `DateOffset()` |\n" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2019-04-01', '2019-04-01', '2018-04-01'], dtype='datetime64[ns]', freq=None)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Date time 資料型態的建立可以接受各種不同的格式\n", "import datetime\n", "pd.to_datetime(['4/1/2019', np.datetime64('2019-04-01'), datetime.datetime(2018, 4, 1)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", " date_range(start, end, periods, freq, ...)\n", "\n", " 參數:\n", " start - 開始時間\n", " end - 結束時間\n", " periods - 共產生幾個時間點\n", " freq - 指定時距週期頻率的格式字串\n", "```\n", "\n", "[`date_range`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) 的 `freq` 參數可接受指定格式字串:\n", "+ `D` - calendar day frequency\n", "+ `B` - business day frequency\n", "+ `W` - weekly frequency\n", "+ `M` - month end frequency\n", "+ `Q` - quarter end frequency\n", "+ `Y` - year end frequency\n", "+ `H` - hourly frequency\n", "+ `min` - minutely frequency\n", "+ `S` - secondly frequency\n", "+ `ms` - milliseconds\n", "+ `us` - microseconds\n", "+ `N` - nanoseconds\n" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2019-04-01 00:00:00', '2019-04-01 01:00:00',\n", " '2019-04-01 02:00:00', '2019-04-01 03:00:00',\n", " '2019-04-01 04:00:00', '2019-04-01 05:00:00'],\n", " dtype='datetime64[ns]', freq='H')" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 產生固定時間週期的序列\n", "pd.date_range('2019-04-01', periods=6, freq='H')" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-05-03 is Friday\n" ] } ], "source": [ "# Timestamp 物件的建立\n", "aFriday = pd.Timestamp('2019-05-03')\n", "print(aFriday.date(), 'is', aFriday.day_name())" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-05-03 加一天 = 2019-05-04 is Saturday\n" ] } ], "source": [ "# 增加一天(絕對的連續性時間間距)\n", "aSaturday = aFriday + pd.Timedelta('1 day')\n", "print(aFriday.date(),'加一天 =', aSaturday.date(), 'is', aSaturday.day_name())" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-05-03 加一個工作天 = 2019-05-06 is Monday\n" ] } ], "source": [ "# 增加一個工作天(Business Day,相對的時間間距概念)\n", "aMonday = aFriday + pd.offsets.BDay()\n", "print(aFriday.date(),'加一個工作天 =', aMonday.date(), 'is', aMonday.day_name())" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-05-01 -0.101792\n", "2019-05-02 -0.571034\n", "2019-05-03 1.078528\n", "dtype: float64" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 時間序列的 DataFrame 及 Series 資料,可以使用 Timestamp 當作 index\n", "pd.Series(np.random.randn(3), index=[pd.Timestamp('2019-05-01'), pd.Timestamp('2019-05-02'), pd.Timestamp('2019-05-03')])" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04 -0.609403\n", "2019-05 1.268617\n", "2019-06 0.913145\n", "Freq: M, dtype: float64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 時間序列的 DataFrame 及 Series 資料,也可以使用 Period 當作 index\n", "pd.Series(np.random.randn(3), index=[pd.Period('2019-04'), pd.Period('2019-05'), pd.Period('2019-06')])" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "時間欄位被切割的 DataFrame:\n", " year month day hour\n", "0 2015 2 4 2\n", "1 2016 3 5 3\n" ] }, { "data": { "text/plain": [ "0 2015-02-04 02:00:00\n", "1 2016-03-05 03:00:00\n", "dtype: datetime64[ns]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 如果日期時間被分割成不同的欄位,也可以用 to_datetime() 作合併轉換,但欄位標籤要是可以辨識的名字,如: year, month, day, ...\n", "dfYMDH = pd.DataFrame({'year': [2015, 2016], 'month': [2, 3], 'day': [4, 5], 'hour': [2, 3]})\n", "print('時間欄位被切割的 DataFrame:\\n', dfYMDH)\n", "\n", "pd.to_datetime(dfYMDH)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-02 0.983399\n", "2019-04-03 -0.294962\n", "2019-04-04 -0.962004\n", "2019-04-05 -0.178729\n", "2019-04-06 1.132095\n", "Freq: D, Name: A, dtype: float64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Series 的 index 是時間,可以用中括號及時間字串選取資料\n", "df.A['2019-04-02':'2019-04-06']" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-04-01 -1.216620\n", "2019-04-02 -0.701624\n", "2019-04-03 -1.077161\n", "2019-04-04 0.157794\n", "2019-04-05 -0.639334\n", "2019-04-06 -0.987462\n", "2019-04-07 0.200000\n", "Freq: D, Name: B, dtype: float64" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 可以用部分字串選取大範圍時間\n", "df.B['2019-04']" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-08</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D F\n", "2019-04-02 0.461541 -1.216620 0.704009 0.271539 True\n", "2019-04-03 0.983399 -0.701624 0.894597 0.096956 False\n", "2019-04-04 -0.294962 -1.077161 -0.834394 -1.162824 True\n", "2019-04-05 -0.962004 0.157794 -2.070301 0.229414 True\n", "2019-04-06 -0.178729 -0.639334 -0.724654 0.377905 False\n", "2019-04-07 1.132095 -0.987462 0.536977 1.953742 True\n", "2019-04-08 0.100000 0.200000 0.300000 0.400000 False" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 shift() 將所有資料列向前(正的 Period)或向後(負的 Period)移動,\n", "# 注意: 試試 df.shift(1),沒有指定 freq 參數的話,資料與時間不會對齊\n", "df.shift(1, freq='D')" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-02</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-08</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D F\n", "2019-04-02 0.461541 -1.216620 0.704009 0.271539 True\n", "2019-04-03 0.983399 -0.701624 0.894597 0.096956 False\n", "2019-04-04 -0.294962 -1.077161 -0.834394 -1.162824 True\n", "2019-04-05 -0.962004 0.157794 -2.070301 0.229414 True\n", "2019-04-06 -0.178729 -0.639334 -0.724654 0.377905 False\n", "2019-04-07 1.132095 -0.987462 0.536977 1.953742 True\n", "2019-04-08 0.100000 0.200000 0.300000 0.400000 False" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用 tshift() 將所有資料列向前(正的 Period)或向後(負的 Period)移動,\n", "# 結果與 shift(1, freq='D') 一樣\n", "df.tshift(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### § 時間序列重新取樣\n", "\n", "針對 index 是時間序列的 Series 和 DataFrame,Pandas 設計了有如**“時間 groupby”**的 [**resampler**](http://pandas.pydata.org/pandas-docs/stable/reference/resampling.html) 機制,容許簡單有效率地對時間序列作頻率的轉換。 Series 和 DataFrame 分別都提供了方法用來建立 **resampler** 物件:\n", "+ [`Series.resample()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html)\n", "+ [`DataFrame.resample()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html)\n" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(list,\n", " {Timestamp('2019-04-01 00:00:00', freq='2D'): [0, 1],\n", " Timestamp('2019-04-03 00:00:00', freq='2D'): [2, 3],\n", " Timestamp('2019-04-05 00:00:00', freq='2D'): [4, 5],\n", " Timestamp('2019-04-07 00:00:00', freq='2D'): [6]})" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Downsampling 成兩天一群\n", "df.resample('2D').indices" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " <th>F</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2019-04-01 00:00:00</th>\n", " <td>0.461541</td>\n", " <td>-1.216620</td>\n", " <td>0.704009</td>\n", " <td>0.271539</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-01 06:00:00</th>\n", " <td>0.592006</td>\n", " <td>-1.087871</td>\n", " <td>0.751656</td>\n", " <td>0.227893</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-01 12:00:00</th>\n", " <td>0.722470</td>\n", " <td>-0.959122</td>\n", " <td>0.799303</td>\n", " <td>0.184247</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-01 18:00:00</th>\n", " <td>0.852934</td>\n", " <td>-0.830373</td>\n", " <td>0.846950</td>\n", " <td>0.140602</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02 00:00:00</th>\n", " <td>0.983399</td>\n", " <td>-0.701624</td>\n", " <td>0.894597</td>\n", " <td>0.096956</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02 06:00:00</th>\n", " <td>0.663809</td>\n", " <td>-0.795508</td>\n", " <td>0.462349</td>\n", " <td>-0.217989</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02 12:00:00</th>\n", " <td>0.344218</td>\n", " <td>-0.889392</td>\n", " <td>0.030101</td>\n", " <td>-0.532934</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-02 18:00:00</th>\n", " <td>0.024628</td>\n", " <td>-0.983277</td>\n", " <td>-0.402146</td>\n", " <td>-0.847879</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03 00:00:00</th>\n", " <td>-0.294962</td>\n", " <td>-1.077161</td>\n", " <td>-0.834394</td>\n", " <td>-1.162824</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03 06:00:00</th>\n", " <td>-0.461723</td>\n", " <td>-0.768422</td>\n", " <td>-1.143371</td>\n", " <td>-0.814765</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03 12:00:00</th>\n", " <td>-0.628483</td>\n", " <td>-0.459683</td>\n", " <td>-1.452348</td>\n", " <td>-0.466705</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-03 18:00:00</th>\n", " <td>-0.795243</td>\n", " <td>-0.150945</td>\n", " <td>-1.761325</td>\n", " <td>-0.118645</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04 00:00:00</th>\n", " <td>-0.962004</td>\n", " <td>0.157794</td>\n", " <td>-2.070301</td>\n", " <td>0.229414</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04 06:00:00</th>\n", " <td>-0.766185</td>\n", " <td>-0.041488</td>\n", " <td>-1.733890</td>\n", " <td>0.266537</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04 12:00:00</th>\n", " <td>-0.570367</td>\n", " <td>-0.240770</td>\n", " <td>-1.397478</td>\n", " <td>0.303660</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-04 18:00:00</th>\n", " <td>-0.374548</td>\n", " <td>-0.440052</td>\n", " <td>-1.061066</td>\n", " <td>0.340782</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05 00:00:00</th>\n", " <td>-0.178729</td>\n", " <td>-0.639334</td>\n", " <td>-0.724654</td>\n", " <td>0.377905</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05 06:00:00</th>\n", " <td>0.148977</td>\n", " <td>-0.726366</td>\n", " <td>-0.409246</td>\n", " <td>0.771864</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05 12:00:00</th>\n", " <td>0.476683</td>\n", " <td>-0.813398</td>\n", " <td>-0.093838</td>\n", " <td>1.165824</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-05 18:00:00</th>\n", " <td>0.804389</td>\n", " <td>-0.900430</td>\n", " <td>0.221570</td>\n", " <td>1.559783</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06 00:00:00</th>\n", " <td>1.132095</td>\n", " <td>-0.987462</td>\n", " <td>0.536977</td>\n", " <td>1.953742</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06 06:00:00</th>\n", " <td>0.874071</td>\n", " <td>-0.690597</td>\n", " <td>0.477733</td>\n", " <td>1.565307</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06 12:00:00</th>\n", " <td>0.616047</td>\n", " <td>-0.393731</td>\n", " <td>0.418489</td>\n", " <td>1.176871</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-06 18:00:00</th>\n", " <td>0.358024</td>\n", " <td>-0.096866</td>\n", " <td>0.359244</td>\n", " <td>0.788436</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2019-04-07 00:00:00</th>\n", " <td>0.100000</td>\n", " <td>0.200000</td>\n", " <td>0.300000</td>\n", " <td>0.400000</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D F\n", "2019-04-01 00:00:00 0.461541 -1.216620 0.704009 0.271539 True\n", "2019-04-01 06:00:00 0.592006 -1.087871 0.751656 0.227893 True\n", "2019-04-01 12:00:00 0.722470 -0.959122 0.799303 0.184247 True\n", "2019-04-01 18:00:00 0.852934 -0.830373 0.846950 0.140602 True\n", "2019-04-02 00:00:00 0.983399 -0.701624 0.894597 0.096956 False\n", "2019-04-02 06:00:00 0.663809 -0.795508 0.462349 -0.217989 False\n", "2019-04-02 12:00:00 0.344218 -0.889392 0.030101 -0.532934 False\n", "2019-04-02 18:00:00 0.024628 -0.983277 -0.402146 -0.847879 False\n", "2019-04-03 00:00:00 -0.294962 -1.077161 -0.834394 -1.162824 True\n", "2019-04-03 06:00:00 -0.461723 -0.768422 -1.143371 -0.814765 True\n", "2019-04-03 12:00:00 -0.628483 -0.459683 -1.452348 -0.466705 True\n", "2019-04-03 18:00:00 -0.795243 -0.150945 -1.761325 -0.118645 True\n", "2019-04-04 00:00:00 -0.962004 0.157794 -2.070301 0.229414 True\n", "2019-04-04 06:00:00 -0.766185 -0.041488 -1.733890 0.266537 True\n", "2019-04-04 12:00:00 -0.570367 -0.240770 -1.397478 0.303660 True\n", "2019-04-04 18:00:00 -0.374548 -0.440052 -1.061066 0.340782 True\n", "2019-04-05 00:00:00 -0.178729 -0.639334 -0.724654 0.377905 False\n", "2019-04-05 06:00:00 0.148977 -0.726366 -0.409246 0.771864 False\n", "2019-04-05 12:00:00 0.476683 -0.813398 -0.093838 1.165824 False\n", "2019-04-05 18:00:00 0.804389 -0.900430 0.221570 1.559783 False\n", "2019-04-06 00:00:00 1.132095 -0.987462 0.536977 1.953742 True\n", "2019-04-06 06:00:00 0.874071 -0.690597 0.477733 1.565307 True\n", "2019-04-06 12:00:00 0.616047 -0.393731 0.418489 1.176871 True\n", "2019-04-06 18:00:00 0.358024 -0.096866 0.359244 0.788436 True\n", "2019-04-07 00:00:00 0.100000 0.200000 0.300000 0.400000 False" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Upsampling 成每6小時一群,原本沒有的資料如同漏失資料一樣要再進行填補\n", "df.resample('6H').interpolate().fillna(method='ffill')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }