{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# pandas 기초 (2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 싸이그래머 : 생물심리Py\n", "* 발표자 : 김무성 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 차례" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 추천 사이트\n", "* pandas로 데이터 다루기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 추천 사이트" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 통계 기초 ppt : Basic statistics: a survival guide - https://education.med.imperial.ac.uk/ext/intercalate11-12/statistics.ppt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# pandas로 데이터 다루기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 기본 연산, 함수, 정렬, 중복색인\n", "* 기본 자료 특성 탐색 \n", "* 누락값 처리\n", "* 계층적 색인\n", "* 로우 색인과 컬럼 색인 교환\n", "* Panel 데이터\n", "* 데이터 입출력\n", "* 데이터 다듬기\n", "* 데이터 그룹 연산" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 기본 연산, 함수, 정렬, 중복 색인" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 산술연산\n", "* 함수 적용과 매핑\n", "* 데이터 정렬, 순위\n", "* 중복 색인" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 산술연산" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* pandas에서 중요한 기능은 색인이 다른 객체 간의 산술연산이다. \n", "* 객체를 더할 때 짝이 맞지 않는 색인이 있다면 결과에 두 색인이 통합된다" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 7.3\n", "c -2.5\n", "d 3.4\n", "e 1.5\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])\n", "s1" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a -2.1\n", "c 3.6\n", "e -1.5\n", "f 4.0\n", "g 3.1\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])\n", "s2" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 5.2\n", "c 1.1\n", "d NaN\n", "e 0.0\n", "f NaN\n", "g NaN\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 더하기. 겹치는 색인이 없다면 데이터는 NA 값이 됨. \n", "s1 + s2" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bcd
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
\n", "
" ], "text/plain": [ " b c d\n", "Ohio 0 1 2\n", "Texas 3 4 5\n", "Colorado 6 7 8" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DataFrame은 로우와 컬럼 모두에 전파됨\n", "df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),\n", " index=['Ohio', 'Texas', 'Colorado'])\n", "df1" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0 1 2\n", "Ohio 3 4 5\n", "Texas 6 7 8\n", "Oregon 9 10 11" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),\n", " index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n", "df2" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bcde
ColoradoNaNNaNNaNNaN
Ohio 3NaN 6NaN
OregonNaNNaNNaNNaN
Texas 9NaN 12NaN
UtahNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " b c d e\n", "Colorado NaN NaN NaN NaN\n", "Ohio 3 NaN 6 NaN\n", "Oregon NaN NaN NaN NaN\n", "Texas 9 NaN 12 NaN\n", "Utah NaN NaN NaN NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 + df2" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0 1 2\n", "Ohio 3 4 5\n", "Texas 6 7 8\n", "Oregon 9 10 11" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DataFrame과 Series 간의 연산\n", "frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),\n", " index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n", "frame" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 0\n", "d 1\n", "e 2\n", "Name: Utah, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series = frame.ix[0]\n", "series" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
\n", "
" ], "text/plain": [ " b d e\n", "Utah 0 0 0\n", "Ohio 3 3 3\n", "Texas 6 6 6\n", "Oregon 9 9 9" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 기본적으로 DataFrame 과 Series 간의 산술연산은 \n", "# Series의 색인을 DataFrame의 칼럼에 맞추고 아래 로우로 전파\n", "# 즉 broadcasting\n", "frame - series" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 0\n", "e 1\n", "f 2\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 만약 색인 값을 DataFrame의 칼럼이나 Series의 색인에서 찾을 수 없다면 \n", "# 그 객체는 형식을 맞추기 위해 재색인\n", "series2 = Series(range(3), index=['b', 'e', 'f'])\n", "series2" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bdef
Utah 0NaN 3NaN
Ohio 3NaN 6NaN
Texas 6NaN 9NaN
Oregon 9NaN 12NaN
\n", "
" ], "text/plain": [ " b d e f\n", "Utah 0 NaN 3 NaN\n", "Ohio 3 NaN 6 NaN\n", "Texas 6 NaN 9 NaN\n", "Oregon 9 NaN 12 NaN" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame + series2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 함수 적용과 매핑" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah 1.230536 0.954398 0.779991
Ohio 0.786592-0.289047 0.694158
Texas 0.374949 0.655600 0.573002
Oregon-1.460670-1.101064-0.326258
\n", "
" ], "text/plain": [ " b d e\n", "Utah 1.230536 0.954398 0.779991\n", "Ohio 0.786592 -0.289047 0.694158\n", "Texas 0.374949 0.655600 0.573002\n", "Oregon -1.460670 -1.101064 -0.326258" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),\n", " index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n", "frame" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah 1.230536 0.954398 0.779991
Ohio 0.786592 0.289047 0.694158
Texas 0.374949 0.655600 0.573002
Oregon 1.460670 1.101064 0.326258
\n", "
" ], "text/plain": [ " b d e\n", "Utah 1.230536 0.954398 0.779991\n", "Ohio 0.786592 0.289047 0.694158\n", "Texas 0.374949 0.655600 0.573002\n", "Oregon 1.460670 1.101064 0.326258" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# NumPy의 유니버셜 함수(배열의 각 원소에 적용되는 메서드) 적용 가능\n", "np.abs(frame)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 2.691205\n", "d 2.055462\n", "e 1.106249\n", "dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply를 통해 각 로우나 칼럼의 1차원 배열에 함수를 적용\n", "f = lambda x: x.max() - x.min()\n", "\n", "frame.apply(f)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Utah 0.450544\n", "Ohio 1.075639\n", "Texas 0.280651\n", "Oregon 1.134412\n", "dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.apply(f, axis=1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
min-1.460670-1.101064-0.326258
max 1.230536 0.954398 0.779991
\n", "
" ], "text/plain": [ " b d e\n", "min -1.460670 -1.101064 -0.326258\n", "max 1.230536 0.954398 0.779991" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply 메서드에 전달된 함수는 스칼라 값을 반환할 필요 없으며, \n", "# Series 또는 여러 값을 반환해도 된다.\n", "def f(x):\n", " return Series([x.min(), x.max()], index=['min', 'max'])\n", "\n", "frame.apply(f)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bde
Utah 1.23 0.95 0.78
Ohio 0.79 -0.29 0.69
Texas 0.37 0.66 0.57
Oregon -1.46 -1.10 -0.33
\n", "
" ], "text/plain": [ " b d e\n", "Utah 1.23 0.95 0.78\n", "Ohio 0.79 -0.29 0.69\n", "Texas 0.37 0.66 0.57\n", "Oregon -1.46 -1.10 -0.33" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 각 원소에 함수가 적용되게 하고 싶으면 applymap\n", "format = lambda x: '%.2f' % x\n", "\n", "frame.applymap(format)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 데이터 정렬, 순위" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 색인을 기준으로 정렬\n", "* 값을 기준으로 정렬\n", "* 특정 로우나 컬럼만 정렬" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 색인을 기준으로 정렬" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 0\n", "a 1\n", "b 2\n", "c 3\n", "dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = Series(range(4), index=['d', 'a', 'b', 'c'])\n", "obj" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "d 0\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.sort_index()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dabc
three 0 1 2 3
one 4 5 6 7
\n", "
" ], "text/plain": [ " d a b c\n", "three 0 1 2 3\n", "one 4 5 6 7" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],\n", " columns=['d', 'a', 'b', 'c'])\n", "frame" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dabc
one 4 5 6 7
three 0 1 2 3
\n", "
" ], "text/plain": [ " d a b c\n", "one 4 5 6 7\n", "three 0 1 2 3" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
three 1 2 3 0
one 5 6 7 4
\n", "
" ], "text/plain": [ " a b c d\n", "three 1 2 3 0\n", "one 5 6 7 4" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index(axis=1)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dcba
three 0 3 2 1
one 4 7 6 5
\n", "
" ], "text/plain": [ " d c b a\n", "three 0 3 2 1\n", "one 4 7 6 5" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index(axis=1, ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 값을 기준으로 정렬" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 7\n", "2 -3\n", "3 2\n", "dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = Series([4, 7, -3, 2])\n", "obj" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2 -3\n", "3 2\n", "0 4\n", "1 7\n", "dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.order()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 NaN\n", "2 7\n", "3 NaN\n", "4 -3\n", "5 2\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 정렬시 NaN은 가장 마지막에 위치\n", "obj = Series([4, np.nan, 7, np.nan, -3, 2])\n", "obj" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4 -3\n", "5 2\n", "0 4\n", "2 7\n", "1 NaN\n", "3 NaN\n", "dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.order()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 특정 로우나 컬럼만 정렬" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0 0 4
1 1 7
2 0-3
3 1 2
\n", "
" ], "text/plain": [ " a b\n", "0 0 4\n", "1 1 7\n", "2 0 -3\n", "3 1 2" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = DataFrame({'b': [4, 7, -3, 2],\n", " 'a': [0, 1, 0, 1]})\n", "frame" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2 0-3
3 1 2
0 0 4
1 1 7
\n", "
" ], "text/plain": [ " a b\n", "2 0 -3\n", "3 1 2\n", "0 0 4\n", "1 1 7" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sort_index(by='b')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 0 4 1
1 1 7-2
2 0-3 0
3 1 2 1
\n", "
" ], "text/plain": [ " a b c\n", "0 0 4 1\n", "1 1 7 -2\n", "2 0 -3 0\n", "3 1 2 1" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 여러 개의 컬럼 정렬\n", "frame2 = DataFrame({'b': [4, 7, -3, 2],\n", " 'a': [0, 1, 0, 1],\n", " 'c': [1, -2, 0, 1]})\n", "frame2" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
2 0-3 0
0 0 4 1
1 1 7-2
3 1 2 1
\n", "
" ], "text/plain": [ " a b c\n", "2 0 -3 0\n", "0 0 4 1\n", "1 1 7 -2\n", "3 1 2 1" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame2.sort_index(by=['a', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 기본 자료 특성 탐색" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 자료특성 요약\n", "* 상관관계와 공분산\n", "* 유일 값, 카운트, 멤버십" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 자료특성 요약" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a 1.40 NaN
b 7.10-4.5
c NaN NaN
d 0.75-1.3
\n", "
" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 7.10 -4.5\n", "c NaN NaN\n", "d 0.75 -1.3" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame([[1.4, np.nan], [7.1, -4.5],\n", " [np.nan, np.nan], [0.75, -1.3]],\n", " index=['a', 'b', 'c', 'd'],\n", " columns=['one', 'two'])\n", "df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
count 3.000000 2.000000
mean 3.083333-2.900000
std 3.493685 2.262742
min 0.750000-4.500000
25% 1.075000-3.700000
50% 1.400000-2.900000
75% 4.250000-2.100000
max 7.100000-1.300000
\n", "
" ], "text/plain": [ " one two\n", "count 3.000000 2.000000\n", "mean 3.083333 -2.900000\n", "std 3.493685 2.262742\n", "min 0.750000 -4.500000\n", "25% 1.075000 -3.700000\n", "50% 1.400000 -2.900000\n", "75% 4.250000 -2.100000\n", "max 7.100000 -1.300000" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one 9.25\n", "two -5.80\n", "dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.40\n", "b 2.60\n", "c NaN\n", "d -0.55\n", "dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a NaN\n", "b 1.300\n", "c NaN\n", "d -0.275\n", "dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(axis=1, skipna=False)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one b\n", "two d\n", "dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.idxmax()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 a\n", "2 b\n", "3 c\n", "4 a\n", "5 a\n", "6 b\n", "7 c\n", "8 a\n", "9 a\n", "10 b\n", "11 c\n", "12 a\n", "13 a\n", "14 b\n", "15 c\n", "dtype: object" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 수치 데이터가 아니면 빈도 관련 기술요약치를 반환\n", "obj = Series(['a', 'a', 'b', 'c'] * 4)\n", "obj" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 16\n", "unique 3\n", "top a\n", "freq 8\n", "dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 상관관계와 공분산" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 주식 데이터 가져오기\n", "import pandas.io.data as web\n", "\n", "all_data = {}\n", "for ticker in ['AAPL', 'IBM', 'MSFT']:\n", " all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLIBMMSFT
Date
2000-01-03 3.77 94.16 41.47
2000-01-04 3.45 90.97 40.07
2000-01-05 3.50 94.16 40.49
2000-01-06 3.20 92.54 39.14
2000-01-07 3.35 92.13 39.65
\n", "
" ], "text/plain": [ " AAPL IBM MSFT\n", "Date \n", "2000-01-03 3.77 94.16 41.47\n", "2000-01-04 3.45 90.97 40.07\n", "2000-01-05 3.50 94.16 40.49\n", "2000-01-06 3.20 92.54 39.14\n", "2000-01-07 3.35 92.13 39.65" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "price = DataFrame({tic: data['Adj Close']\n", " for tic, data in all_data.iteritems()})\n", "price.head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLIBMMSFT
Date
2000-01-03 133949200 10347700 53228400
2000-01-04 128094400 8227800 54119000
2000-01-05 194580400 12733200 64059600
2000-01-06 191993200 7971900 54976600
2000-01-07 115183600 11856700 62013600
\n", "
" ], "text/plain": [ " AAPL IBM MSFT\n", "Date \n", "2000-01-03 133949200 10347700 53228400\n", "2000-01-04 128094400 8227800 54119000\n", "2000-01-05 194580400 12733200 64059600\n", "2000-01-06 191993200 7971900 54976600\n", "2000-01-07 115183600 11856700 62013600" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "volume = DataFrame({tic: data['Volume']\n", " for tic, data in all_data.iteritems()})\n", "volume.head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLIBMMSFT
Date
2009-12-24 0.034521 0.004432 0.002600
2009-12-28 0.012070 0.013323 0.005187
2009-12-29-0.011575-0.003517 0.007372
2009-12-30 0.012065 0.005463-0.013904
2009-12-31-0.004208-0.012621-0.015584
\n", "
" ], "text/plain": [ " AAPL IBM MSFT\n", "Date \n", "2009-12-24 0.034521 0.004432 0.002600\n", "2009-12-28 0.012070 0.013323 0.005187\n", "2009-12-29 -0.011575 -0.003517 0.007372\n", "2009-12-30 0.012065 0.005463 -0.013904\n", "2009-12-31 -0.004208 -0.012621 -0.015584" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 각 주식의 퍼센트 변화율\n", "returns = price.pct_change()\n", "returns.tail()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": [ "iVBORw0KGgoAAAANSUhEUgAAAZcAAAEPCAYAAACOU4kjAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n", "AAALEgAACxIB0t1+/AAAIABJREFUeJztnX94HdV55z+vbAQyFrZlG9uAgUYJUQguiKaps86u1Bbb\n", "JN11Au6G0KarJinQbovBFomhUHAbsSENhpSkWxbKDzdpfj3Nksc0qYRhkRvSNCktGIeEJKSB8jsx\n", "JC00SgXRu3+cGd25c+dKV/fO/SHp+3meeTQz95yZc0fS+c553/e8x9wdIYQQIk/amt0AIYQQcw+J\n", "ixBCiNyRuAghhMgdiYsQQojckbgIIYTIHYmLEEKI3GmquJjZmWb2iJl9x8x2ZnzeY2ZfMbOfmNlg\n", "6rPHzOwhM3vAzL7WuFYLIYSYjoXNurGZLQA+BpwBPAX8g5ntdfdvJoo9D1wIvD3jEg70u/sLdW+s\n", "EEKIGdHMkcsbgUfd/TF3fxn4NPC2ZAF3/4G73w+8XOYaVuc2CiGEqIJmisuxwBOJ4yejc5XiwN1m\n", "dr+ZnZdry4QQQtRE08xiBHGohQ3u/oyZrQT2mdkj7v6lPBomhBCiNpopLk8BaxPHawmjl4pw92ei\n", "nz8wszsIZrYicTEzJU4TQogqcPea3A7NNIvdD7zGzE40s3bgHGBvmbJFX9LMFplZZ7R/JLAJOJhV\n", "0d1barvqqqua3obZ0KZWbZfapDbNh3blQdNGLu7+ipn9HjACLABucfdvmtkF0ef/x8xWA/8AHAVM\n", "mNlFwMnA0cD/NTMI3+Ev3f2uZnwPIYQQpTTTLIa7/w3wN6lz/yex/yzFprOYl4DT6ts6IYQQ1aIZ\n", "+g2mv7+/2U0ooRXbBK3ZLrWpMtSmymnVdtWK5WVfa0XMzOfy9xNCiHpgZvgsdugLIYSYo0hchBBC\n", "5I7ERQghRO5IXIQQQuSOxEUIIUTuSFyEEELkjsRFCCFE7khchBBC5I7ERQghRO5IXIQQQuSOxEWI\n", "WcjIyAibNm1l06atjIyMNLs5QpSg3GJCzDJGRkY466wBxsY+BEBHx07uuGMPmzdvbnLLxFwhj9xi\n", "EhchZhmbNm1l374twEB0Zg8bN+7lrrs+18xmiTmEElcKMW85CGyNtsxFWFsCme/mLxIXIaqkWR1n\n", "X9/pwM3Almi7OTrXWsTmu337trBv3xbOOmtAAjOPaOpKlELMVtJ+j/vuG2iY32P//n8CbqBgFoP9\n", "+/dy+eV1v/WM2L37puj5hHaOjYVz8g3NDzRyEaIKijvOIDK7d9/UtPYcOvS8zE+ipdDIRYhZxuDg\n", "+dx33wBjY+G4vf19PPzwy4yPfwRo7ChqKtLt7OjYyeDgnqa2STQORYsJUQXNDgceGRmZHCkdOvQc\n", "DzxwHq0YPZZs5+Dg+U0XPFEZCkWeBomLqCet0nEqNFnkjcRlGiQuYj7Q7FGUmHvM+nkuZnammT1i\n", "Zt8xs50Zn/eY2VfM7CdmNjiTukLMFzZv3swdd4TRysaNeyUsoiVo2sjFzBYA3wLOAJ4C/gE4192/\n", "mSizEjgBeDvwQ3ffXWndqJxGLkIIMUNm+8jljcCj7v6Yu78MfBp4W7KAu//A3e8HXp5pXSGEEM2j\n", "meJyLPBE4vjJ6Fy96wohhKgzzZznUou9quK6u3btmtzv7++nv7+/htsKIaB1IuVEPoyOjjI6Oprr\n", "NZvpc1kP7HL3M6Pjy4AJd/9QRtmrgJcSPpeK6srnIkT+KDpt7jPbfS73A68xsxPNrB04B9hbpmz6\n", "S86krhAiR1ot9Y1oTZomLu7+CvB7wAjwDeAz7v5NM7vAzC4AMLPVZvYEsB24wsz+xcwWl6vbnG8i\n", "xOxCafBFI9AkSiHmEXmYtGQWm/tohv40SFyEKCavVDFy6M9t8hAXZUUWQsyYzZs3S1DElEhchJhH\n", "KA2+aBQyiwkxz5BJS0yHfC7TIHERQoiZM9vnuQgh6kirhhy3artEvmjkIkSDaKQ5qlXDhVu1XaKY\n", "PEYuuPuc3cLXE6IxDA8P+8aNZ/vGjWf78PBwyWcdHascbne43Ts6VpWUyZONG8+O7uXRdrtv3Hh2\n", "3e4329slion6zpr6X0WLCZED6Tfy++4bKHojL06ZAmNj4Zze2MVcRT4XIXKgXvm2qvVPDA6eT0fH\n", "TmAPsCcKOT6/5vbUSqu2S+SPxEWIBlBNpxqPhvbt28K+fVs466yBigUmvfTx5ZdfyO7dNzXdia4l\n", "mecRtdrVWnlDPhfRICrxqUzlk8kiL/9Eo/09YvaDfC5CtAbxG3khGqz0jbxZKVPk7xHNQOIiRE7k\n", "LR6tnqpFM/3FVMjnIkSDmKlzvhL/RCXX7Os7HdhG7O+BbdG56toV16nWHyTmCbXa1Vp5Qz4X0SLU\n", "w+9R6TWD72bQ4exoG5z03VTbLs1XmduQg89FIxchciZrJFCPUOWZXXMd8LloW1flNYSoHPlchKiC\n", "cv6GcpMpm0k9fDet7g8SLUCtQ59W3pBZTNSBqUxJ5cxFjTCLtbUt86GhobJls8Kga2nXTEOrxeyB\n", "HMxiTReAem4SF1EPpvI3TPVZPTrjoaEhb2tb7rDeYbAq0ZJIiDR5iIvMYmLek2dI7VTmompDldPt\n", "AyaPDx16jomJ3RTmsKyb8RwWLVks6kKt6tTKGxq5iGmYziyU9VZfTZ282tfevtTb21cWmcJCJFjp\n", "SEmIakFmMYmLqI3pzFjlRGQqH0byfK1CU9q+9SXtDWYxpXYR+ZGHuDTVLGZmZwIfARYAf+7uH8oo\n", "cwPwFuDHwG+6+wPR+ceAfwN+Crzs7m9sVLtFZVx99dVcd91tAOzY8W4uv/zyJrdoZkyVNiXLlJSO\n", "FNu//zeAlxkf/whQmoY/L0499RRWrNgLMGmC27Rpa3SsmfOiSdSqTtVuBEF5FDgROAx4EHhdqsxb\n", "gS9G+78A/H3is+8BXdPcIwcNF9UwNDTkcNTkGzUcVTaSqZlUE/lVjqzyYaRRWf1K2pc2i7W3r/Te\n", "3g1FI6V0BFlv7waNZhrEXAmOYDabxYA3AcOJ40uBS1NlbgTOSRw/Aqzygrgsn+YeeTxnUQVdXd0l\n", "HW1XV3ezm5VJuQ5hppFY9RCXrPYNDQ15V1e3d3au9YULlxQJY3f3aZltkLms/syl7NOzXVx+Fbg5\n", "cfwu4KOpMncC/ylxfDdwerT/z8ADwP3AeWXukc+TFjNmtohLZfM/Bh2WeUfH6ilHX6WjjJXe3r50\n", "ys5mpm+6xYJ3SuoZh3aWisvZcvQ3gLmUEicPcWmmz8UrLGdlzr/Z3Z82s5XAPjN7xN2/lC60a9eu\n", "yf3+/n76+/tn2k5RBTt2vJsrrtiWOLONHTve37T2QHZIb7mliQv+ltXATuB6xsbgiiu2cdttn+FP\n", "//TDmSn1i9PufxygbBr+q6++miuvvJ6JidcAx3LPPb/Oqaeewgc/eFmmn2RkZIQrr9zNxMT10ZnB\n", "5KfAZ4BVwPsS57cDnwKeneHTKkVZkOcuo6OjjI6O5nvRWtWp2g1YT7FZ7DJgZ6rMjcA7E8eTZrFU\n", "uauAwYzztUu4qJrYfNPV1d10f0uWyaK3t6/sm2bhLTTb1FWryWN4eDgKI45HRkf5dCOcMBpc7zCc\n", "GKksjX6uKBplwfEOPdFWu4mmkSaf2eq3kFmsdcxiC4HvEhz67Uzv0F9P5NAHFgGd0f6RwJeBTRn3\n", "yO1hi9lNlskiy3QXd2i9vX2R+amnrJkpdqRP1QmW6yiL21M+ZczGjWd7b++GIic+rIoE5nY36/TO\n", "zuMjUdmQMIuFIIo1a47PpZNulMlntnfQs1UY08xqcQnt5y3AtwhRY5dF5y4ALkiU+Vj0+QEK/pZX\n", "RWL0IPD1uG7G9XN72GJ2k9U59vZuKOnIhoaGis6FEcWijI59MDHyKD/ayLr+xo1np4Qtq219Ud1B\n", "h+NSI5YweopzifX2bohGLqVzYMyW5dLJNUpc5pLfYjYz68Wl3pvERcSUeyNOv2lmmcpiZ34YFQx6\n", "8cTF8p1g1joq0BX9TJrCis1i8YgDtnowe5WOWLq6uidFo9DmLBPecbl0zo0aUUhcWoM8xEW5xcS8\n", "YKo17pPp8g8c+HpG7dfy5jcfQ1/f6Vx33W2Mj3+S8fGFjI/fSHD4h/r//M//XDR58dCh54C/Ba6N\n", "rnMJcDTwCcKKkNDWdhETEwuAZcBtUbljeOaZ54B7CHOMBwgO+5+J9v+dE074Oe6//352776Jxx9/\n", "Mqp3PvGEz8L9VlT1vNJM9fzyRKn85xC1qlMrb2jkMm/Iw9ZdGGkkzWJhtNLb2xf5PZLO89uj/UGH\n", "JQ5HFr3ZZ885SY4yBt1sSeJaK1Mjla7o53CqTfE9jyozCuryEKZ8lLe3L60o71kr+QpaqS3zFWQW\n", "k7iIyk02yU4r9n0kO7CCSWY46vzXO/R4R8cqX7PmxOi429OJImG1F3weBb9IsV9lOFV/vZtlzUlJ\n", "TrqMI79KfSkFE9jZk2W7urq9u3udd3Ss8YULj/bu7pMrWrtltjvRRf5IXCQuwiuz05dOiiwN/U13\n", "ssH38SofGBhIdPJx+O8Gj/0fhRFGPGroi0Y7G8qMdo7ytrbDE20oJy63e4hWO64CcSkNUEinhin3\n", "nLKDHfqa88sULUEe4iKfi8iVRk60i+/1j/94ANhS9Nk//uMBTj+9H3iFFStWcejQ84kklFuBG0gn\n", "pLzrrs/xjnecyZ49FwM9wEaeeWaYPXs+D/xJdOWdwHsJ0e/vIuRT/R0Kky3fHX12C8uXv4GJiR8T\n", "/CuvJumfmZjYHtW9JNHqbcB/EDIjQQiE/BlCbtaLEuUuidq+DTgPuIS2ttt59NEjixJtjo/DAw/c\n", "CGzhvvsG6OnpqfjZHjjwdUZGRjRRUlSNxKVJzMXZzuXWj6/Hdyu+188QOtqYbbzwwnm88MI6YAew\n", "HrPRiq575533EZzoTwLXExzt51DsKL8ROAbYANxKEJO7CGLzCSB8/7vvvoiQnzV26A8QO/LhtdHP\n", "/wp8gCAqY0AH8NvRZzuADxME6RLCbPtjgRW0td3Ob/zGWTz00N9z4MA3mJi4nhdfvDHjGx0DxA7y\n", "m+no2JnpLL/nnnOZmIjr7GRi4jdnvOiYEEXUOvRp5Y0WNYvNVRt3I8NIS+81GCVzjCcUJk1IfQlz\n", "VqlZrL195eTzD36SZI6u2IGenGOy3GGoxNRVmusry8y13oNzfjAyeaXn1MRtj300K6Jyi0o+6+rq\n", "jua4JP066WCDQruT5rG0szxcZ31kZhuu6+9OtD7I5zI7xWWuxvJX+r3yi+wqvVfWrPvgRI/9F2d7\n", "8JcsjjrT9Q6LfPHiNT40NBQtFbA04xqxU/0ohxM8e+Z+uXrJ4y4P81e6PESYZZUfjq4Vty+eyDno\n", "6cix0pUoByPR6Zs2aWb6d1LrC4+ivOYOEheJS0tRSQeV16it3HUKs9WTb+893t6+0hcuPDLqrJMO\n", "8uHEuUVRPrQTMjr9Li/k8UqPMjzaP9JLRzPp2f2DkdgNe/ZIZ1kkXOnvcIKHEVMscnFE2ykJoRws\n", "ygLQ29tX5NCv5JlWKw5zdTQ+X5G4tLC4TPWPOpf/EafroPIU1qx7DQ0NuVlhVGK22Ht7N/jQ0FAi\n", "P1eygy6eP9LRscKhw9Mz5sMoI21uS85JWerFnf76SGxiQTvbgyltvcNaD6Onnoz7HJ4QlrS4rYlE\n", "MD0XJ8x7aWtb5gMDAyWLhTUiaehcfWGar0hcWlRcKn2Dn48mhHp2QoXnHuaRtLUtn+xYi+8b+yay\n", "5o8sT4ww+ryQriXL93JKJBoneiEcOXmt9V7wq5SbfLnYCyawuEzW5MtVDl2R+JVfsyXLLNjWtnzy\n", "b6xef3cSl7lFHuKiaLE6MNXa6zFZa7DPB+qZ3iM893cRFik9homJDezf/09cfjkcOvR8ouRmYACz\n", "2wjvIAUWLJjgpz8dj46+A5wEbIyOXw38FiF67FbgPcAh4AVCSPDFiSv9LtAJvAL8CvBZQtTYQKLM\n", "XkJe1huBr0Tn1gE3p64VhxyvY2wsGbpcGRMTr5mMTKxXNJ/Stog0EhfRMOLw656eVwO3sWLF8lxz\n", "VIVcXqPAddGZHRw69DquvvpqHnzwIUJo753Aw8D3WbBgglde2Z64wvZIWE4jdPA3ROffCRxGCAuG\n", "0PEfA/w5IdQ4njOzLLpHW3T+mqj8NmAcOEjIEXYT8HRUBkLY81bgdOCvo+PXE+a2HBbdayPxHBm4\n", "hra27YnQ4TDvpaNjJzt2XMiVV24vCisOIdLfq+ilp1oalXtMzCJqHfq08kYLm8XmE8XrowzW7Zlk\n", "5fJqb+9ys6Rv5KjIFJV0zJ/kcLTDER58JGmzU5b5LL7m6tRnJ5cxW4WAgYL5LPajxFFg6czIi7wQ\n", "TRabxeKMAOsnHfVZTvvipZAHJ5+1TFeiUpDPpTXFxX3++lTSlKZUKXSS03VsM32G2WHIWalT1ntI\n", "EjnsISw4jvyK/TBpMckSl9UefDLJNC7xtbLKry1z/pRoP93xl5szs2IySGG60OJ0HrX0omPz/aVH\n", "lEfi0sLiIgJZb8txRxqLS5aIVDP6K4QhD0Yd8XIPo5Ks+w96WAo4Ln926rNFkTCtdjgsOi7Miyl2\n", "8g94CC+ORyVDXjoJ88QyQpclLsOeHSDQ5WFkVfnoL/0c29uXem9v37x/6RFTI3GRuLQ82eJSWIO+\n", "nIhUY8IJEyDTYcRpU9QqL51dv8ILqfCHow48fY2kaW2xw7pIUE7xEFbsXhwmHAvcUg/mtmVR+eR1\n", "l0bC1ROJ2JJEvXS48RIvjIyGPDlhciqhkClMVEMe4iKHvqgr6SiitrbtnHrqyXzwg8Hhu2nT1kwn\n", "80wZGRnh6qs/CpxKyM01kPj0ZuAPgH8nJJb8a0ojt64iOMavJUR5/TohmgvCqtpxssgRgsP+GQo5\n", "w7YBa4EfALdQWOCLqN4thDxlAIMER/9LhASYfwZ8n0IQwjagPfoOeyg4/9cA9xGixm4D3gDs4YUX\n", "rmXfvvrmcROiGiQuoq6URhF9qqIOMCu0ta/vwqKVHpMrSP7ar/0uY2M/AxyecbVnCQLxVUKSyR9k\n", "lPk3Qvbhi4EJQsd+LSHCa5QQLrya0NmfTKmAbScknXwlqhPzZYrFhui6hxOiy7qAodTngwSRuYGQ\n", "7XkQ+Mvoe9xISHK5i6RAlov8Uohw6zAXk9VOSa1Dn1bekFms5ckyiyUd0LF/YGhoqKKFrkqjrGJn\n", "fXJ/2It9IukyscksPYO/KzrOcszH51Z4sU8my3cSrwcTO/9jn0/s7+n2Uh/POo9XvDzssMXe2bk2\n", "dd1B7+w8vmxGiFrSukxXV8Er0zPbIkiRz0XiMhdIRzZV6oOJhae08+6JOvB4oa5CBxxE5bhIgI7z\n", "EIK8wYPPZa0X0rUUAg+Kr73Cg88l6cdJz9xf78GHsjQShqR/J25THNk25KUpYDrKiFdID9PWtsS7\n", "u0/2QjDCKZ7MexanfKm1029krri5zmzzfeUhLjKLiRlRj6F9MlvBTHwwBw58nVNPPSXjk5OBSwn+\n", "jnVxywmmrtj3sY0w0/4x4G6Cn2Ux8K+ENVZ2Aiszrr2CYJ76SVTmp1Fb089hBbAEeBz4EXAF0A18\n", "PKr/8ajcP5FcuCxwRcZ9jyGY4q5lYmIx3/vet6LveWviO+0E9jAxcT1/8AfbOeywhYyPh4mf1fhk\n", "Kpl0Wc+JmWJ2I3ERFZNeDGz//t/g9a8/iRUrVtXdhjw4eD533/2ruMcLYh1kYmIFBw4cJPg7DhKE\n", "JHSwoQM/ksIiYjdS6sS/mTBbP15lMk6t8jfRtf6O4pUiLybMmP9IonxfdJ11iXNjgAO/Twge+AWC\n", "3+Y/gH0EJ/4awmz+o0mvoglLo+8Rk/xOTwOvY2Li+qj+9anvdBOwBfcjGR8v+HLU6TeXeen7qnXo\n", "U8sGnAk8QkjitLNMmRuizw8AvTOsm88YUbh7+bDiPM0h5cwsw8PDiZn28TyUpLlpiSfngBTS3ccT\n", "I6da5yX9fVZ7wSczHJXrcTgmo3zsJ+nyMK+mx4N/JL5WZ6qd6TDnJV6clj8+LjV3FSfUnGqCaJy2\n", "vzYzjMxi+TKbfFPMZp8LIbHSo8CJhNfBB4HXpcq8FfhitP8LwN9XWtclLrlTfkJkdZ1XObL+CYvv\n", "Hae0z5qQuCz62RPtr4i2rV48V+UoDz6WrM55aer6sV8ka9b82al6izz4ZOI1WdLpYcqlkkk677cm\n", "rtfl6RUiCytUDnlxwEEsKj3e3r60qtn46Wcvh/78JA9xaaZZ7I3Ao+7+GICZfRp4G/DNRJktRIuO\n", "u/tXzWypma0mLJo+XV2RM+mhfTABfSL3+1SfMXoxYY35J4EegtX3W8Ai4P8Bv0wwj32b4P/4FsG0\n", "FHMJ8CLBnPYYcCHB3PZlwgD6jymY2Yj2N1JstrqYMH/ly4TszB0Vtv27wMvA71CYP7MH+J+EP+vf\n", "jq5/EYsXL+KllwaAywnzXXZF3/k9tLXdGs0jCqa7mSSSTJs9Yz/NXXd9bsp68zXDt5iaZorLscAT\n", "ieMnCaOT6cocS/BuTldX5ExyzsqhQ8/z8MOvMD7+LLBnynkoeVAsbAY8RLE/ZBvBp3E4hUmIB4F/\n", "oeBTuZgwyfIkgjX1fxPmrnyQ0LlPEIQontB4CcE/EgvEfxAmMcaTK88jCMAnCA79PQRR+zLhT/RJ\n", "QtBA0n/yECHbccxFhLkxvww8RcF3Q9T+doJY3Qg8wsDA2zn33HMjEQhl29u/G/m+vpc5j2j37psm\n", "RWaq34mc8yJPmikuXmE5q+Umu3btmtzv7++nv7+/lsvNe5JvqcnIsb6+C7n66o9OuVZILZFmsbBd\n", "dtkHOHDgG0xMHB998gHgOEJHfwth9BI7srdSiMQaIfy5v4HQ+cc56TdH2yWECK4/pthBfiPwdULQ\n", "wLGEzj8eWVxCiBbbRXDUf4KQ3v726NzXovLJmfbHE9Z3uZjwp/1bUZnthIF6cmSUnu2/h4ceuo3b\n", "b09PTP145rMsNxKRWIg0o6OjjI6O5nvRWu1q1W7AemA4cXwZKcc84T/7nYnjR4BVldR1+VwaynRx\n", "/DNdnTOeSJm24xfuUy64oC/lm0nuJ/N1xZMY4wCBctmMY1/IUodXZdTJyv21wbOd91nzYZL3WuvF\n", "EypLfTzJVSVr/Z2kkXNexDDLfS73A68xsxMJr3TnAOemyuwFfg/4tJmtB37k7s+Z2fMV1BUtxHQm\n", "l/Rb9r59hdUXs9+4z6d4hBGX30j4c4DgmotHAk9HW9yGvYR3lL2EP8UbCGas5MjhEoIf5DMEf8fe\n", "6JrvJ5ir0vNTdhLMYY9HZT9HCFO+hDBSejel82GSvETw9cQ+jksIIxoIJrJbmZg4lssu+0BdRh9a\n", "8EvkSq3qVMsGvIXgVX0UuCw6dwFwQaLMx6LPDwCnT1U34/o56biYjuneegtv0cMeR3v19m7I+Dz5\n", "Jr+hpOzw8HAiCioOAT7Nw7oqi7wQvtsT1R3yQnRVMhVLsj3LE6OhONQ3Dj/e4MUjjThKa7mXtndp\n", "NHpJp9tf7HCCF4ckx+nzk+Xi0VAynHogakMy0m2JDw0N1fw7EaIczOZQ5EZsEpfGMlVIahCFpUUd\n", "bHv7yjKhxu6FVC2hrNnSKNfYBl+4cIkXm6tOjjridBqVZCcdm63iTjrePy7aL5czrMcLc0diAYvr\n", "J+8Xz0Hp8pBif2203xOVjevG7T4qOhffN2ku604IXdbiZeXNY/Gqn11d3d7bu6GseVGIqZC4SFxm\n", "Fb29fSWdZOwDCEvzJt/4l5aULe1oh6PjIz2MJJKd9NboXJw7LBaJDi8sL5zs6IcyRgiLMgQrHgGd\n", "FglTeg5KPNflyEg44uWTY7GL5+IsisQlK7FlYe5QV1e3d3Yen/ks0v6T4lFdGA21ty+VqIgZk4e4\n", "KP2LaBgrViwvOXfo0POTa7FMTLwHuJG2tu+watXRPPNMuvQxif0Rgm9lDSEK7DcJUWDnEsKRf0zw\n", "iUDwoxjBp3IEIeXKIUK4MhTm61xKmFeyjbBmy+GUplfZS4jq2ksIH76FwhyUeC2WXyZEj30oqnMx\n", "haivawnRY1cQotqepOBXidsahzhvY8eO9/OGN7yBt771XCbiADd2EqLSvlf0dHbvvinKJVZo7/j4\n", "jU0NJ553aeZFgVrVqZU35tnIpdVnSpd7sw7LExe/mff2bkil0o8jrYaj/RO8kNolvWpjnDIl+aZ/\n", "vBfSvWSlfenykLH4CC+YvZZllEvOyD/Kg0luZVR/qxcix5L3zzK3JVO3xCOwOEtzIf1+8cguHp0N\n", "ZvpPyqXnqTRzQt5/P/L5zF6QWUziEjNb/pGDkPREHXyfx0v1ZpnL4s6ut7cv8teE77Zw4ZEJc1K5\n", "VDDpMN/jEmLRl1E+ThsTO/ljh37STJZ0uscpZdJ+l+GM+xf7jxYsWBaJWNrBH/uNCv6XdDj3VD6t\n", "3t4NifxrMzOL1ePvZ7almRcF6iouhOXvdkQ/09uOWm/ciG0+iUvWP3JXV3fLCUwQl+L167u71yU6\n", "tkE3W+adncd7b++GyfYXhGZDNPpJCkFW5NbyxD2O8rBWfZxEMna8x58v8xCVtcwLo6B4RBQLWZyv\n", "LB6hlBO1sxPXTIrSVo8XG4sd7WadUTtXefD3JIVpZvnAgviuj9q5uOT5TUc9hEDiMnupt7hMAA8A\n", "/4uwwHjRVuuNG7HNd3GB9S03gsly6vf29pV98zZbXNRJFoc0xya29AgimahypYdRS1ckMEd4IYor\n", "Dls+PHEcC0y8CuWgF5JgdkfHaYEqfubh8yO8OCw6jmwrmLqGh4czR23lVpTMYnh42BcvXuNpwU6G\n", "eSfLlhv51EMIZstoWpRSb3E5jeCRfJCwItFGoK3WGzZym0/iUrrcb8FE0yybexZTdWLFnyX9ED2p\n", "FSnjWewbog5/qQdfRTxDf0OibtoXc1TJ/eEkL14aOenfidP1b/WCKa/HQ7bjk714hLLEgy8oTpmf\n", "/B7LHLaWdLC1dMCFuqWp97u6usuUzb5PvYSg1f2AIpuG+FwIYTb/CfgoIT3rllpv2qhtPomLe/JN\n", "uHjeRCXi0qi3zKnuE0xmsUAkRzBhNNHdfbIfcUTsm0g774vn0ITjrBT5WSHOWY772OQVj1LSqVti\n", "s9tg4l5iBrX4AAAZHklEQVSxPyZrmeLjypopq+2AC2LcV3K/3t6+MmVLRb3Wdoi5R6PE5Wjgd4H9\n", "hPjKN9V600Zt801c3KsXiUbax7M6saxIsjD3JPZtrPHS/F/D0Vv7ai+Yso6PygyVEY3FnjYhFY7T\n", "ghMv3LXEs81u8QJh6ci0JRn36PGuru5cO+5sE2Hx5NTSsvX//YrZT73NYu8lTCYYJeT3WlXrzRq9\n", "zUdxca/uDbTZnU+2zyjtlI/NYbenxCNpyop9GMMZnfxyL4xECuG+wcyVNIslZ9vf7vECXMH01u2F\n", "tDSDnh3WvNpLJ2nGprJ6rdo56G1tyyf9V1OXlf9DTE0jHPoPAXdmbHtrvXEjtvkqLtXQSLNYMvNx\n", "b+8G7+rq9o6OrCWE0+HEsbkvXqExq3zsvI/9EIVcZmHUEpu4kuazw6NzKxPbcdEW+17ilSi7JkUi\n", "nNvqWaOUsPVF4hMLWrFwT/cSkPcqkDJ7iUqpt7j0R1tfYn/yXK03bsQmcZkZ9e580m/ahTXu4+OC\n", "6cksPREx6S85xbPDgNd6YQJjnJolK8V9MlggjjRLi04sHKs8jJKSoctLPYxetnoYpRyZaFNylBK3\n", "rbStvb19TXGwC1EJmkQpcZlVFJu+suaJxGam9d7VdaybxevHD3p7+0o/44wzfOHCo70w1ySd9+sE\n", "L52tn842nBSj2DR2u2evDxObvEqjsQrluz3MkYlHTIf7GWecUSQM6fXs29qWeXf3ySXXTJohm22m\n", "FPObPMSlbG4xMzsJ+H3gBcK6rzcD/4WQ4v633P0fytUVc5v65YtaR8iXtYUXXojzcQFczCuvjPG3\n", "f3uAV175MGFtk1vIXnL4yxTWbIn5A8Lyxbcl7hPnE7upgnb9OOPc14D/B7we+ALJ1Smff36Myy+/\n", "kOuu+wAAO3aE5ZivvHKQiYmVTEwcy3e/+xTBwjxQcmUh5gTlVIfwX3o+8D7CKkvvICwmvhH4aq2q\n", "1ogNjVxypxZzTXF+rK1ebBZLjjLiiYyeMGGl/RZZYcaLPTvUOG3SOsELqfOTo5oss9hRHiZELkl9\n", "NpjaL9yvs3NtyTMKk0fTo6rCddIRXjKLiWZCnX0uDyb2Hy33WStvEpf8qdZcUzrJc4kHv8UGD+HD\n", "y7ywwFecyHLYSydBxmlS4sSScZqW2CyWXqwryxy2OlEudv6vd1jnwW+zOqq3zONklGaLo+PjMq5X\n", "HHiQlSK/MP8oLXxHO6zwjo7VJb4uOeBFs8hDXKZKue+J/Ren+EyIaUkvcwxgth339xAWFH0Pway1\n", "hWD22gacTKmJ60bCIBoKKfV/j5CO/wXgROBTBHPX04Sgx3Wp1iwk/An/LGHp4dXRPeIU+TsJZrYR\n", "4K+AIAULFgzy05/+OON6j0Rth/b29/HqV5/EAw8UlzjhhNX86EePJNLmxxwNPMvY2DXs20fRks7x\n", "JsRsZCpx6TGzg9F+d2IfoLuObRItTF/f6dxzz/bJTrKjYyeDg3umrXfo0PMl59yNxYs/yUsv9QGf\n", "J/hFrqCr63B27Hg/V1xxXcaV4jVdfpsgCCOENVp+Pzq/Lfq5hSASmwnrqcTsAMajz2+Mzm0miMMu\n", "wvoq7yK4GM8ruvOiRUfy4os/n7reJQShuoaurpf55Cc/DsBZZw0wNhZKdHTs5IMf3MP999/PlVdu\n", "TwjMJcBro59BQMfG4LLLPqg1UMTsp9yQhvAKWHardcjUiA2ZxXKlYNoKy++2tS2vaC139zi1SzKl\n", "S7hOR0dsgiqYsbq7T3Z3z0jIGOdLixNBnu3Z6fOP8+LVIRd78YTGeHb9Cal7r/RCXrLizzo6VvnQ\n", "0FAU9ZW8//DkPZPpXcqZtOLzwXSWnBRaaH/wS8nXIpoHCkWWuDSCuEMMfoPSyYCVEHw1pZ3yggUr\n", "SzrXzs7j3T0EAATHe3LyYvF8mOzJlMWdc7yOfHf3aQmhiee/DHpn59qoQ4+d9LGIbfUFC1Z6V1f3\n", "pIiWy2RcTZr8glAXBDQs9VzdM07/vuSrEdVSV3EBXiL4WrK2f6v1xo3YJC61U+qIL07gOJOMy2HN\n", "kUJH2t6+MhqdlDq/Y4aGhryrq9vb2pZ4IUNxem5MOnXLkIe5MsWJIoPAlTrV4044zKGJhWe4qK1J\n", "0cjzmcRr1PT29k3u1zK/RVFmIg80cpG41J1y68SUm1U+XTqTON1Lb2+fDw0NRZMJi81iaVNb8Wgh\n", "qz1xqpXC6COrbeEa8RorhZFCcVbmFUXfsVwnn8dorlwCz1rEQZMvRR5IXCQudafcCpdZvoSZdIrF\n", "ZqFuD2G+i/2II7qKxKXUfJT8Ga+TEq97P+hmXd7ZubZogbEwvyYZnhwWC4t9RsnllMMSynFqmOk7\n", "6bCiZEGssjIST/39s0dG1Zq1JC4iD2atuABdhPT93wbuApaWKXcmIc7zO8DOxPldhLCeB6LtzDL1\n", "83rW85ZKRWOmnVph0a/0PJYeh0WTAlN83SGH49xsUZR7rNChx6alYHortHVgYMCzUu/HJrP091u4\n", "cIl3dq71zs61vnDhkim/d8HUFy8FsN4XLlzS1CUOZBYTeZCHuLTlEnI2cy4F9rn7ScA90XERZrYA\n", "+BhBYE4GzjWz10UfO3Cdu/dG23CD2j0rGBkZYdOmrWzatJWRkZGarrV582buuGMPGzfuZePGvZNz\n", "MPJpSzJVywAh3ctSYBHXXPOniXIHCflSrwfOwf1ncf+TyXrj4x8GFrJixXLGxz8yeX5s7F3s2bOX\n", "EO5bzM/93Kls3rw5Nf9mNa+84rz44rG8+OKxAPT23lz2e+/efRPj4z2E+TZfAb7CK6/8yWQYcTOo\n", "9fclRG7Uqk7VbITRyKpofzXwSEaZNwHDieNLgUuj/auAwQruU6uAzzqa9eaadd8Qurt08q3ebPGk\n", "uWp4eDgRclt4e499KgsXHu3uHo08kj6ZVZ4VfhzWMkk7w+Pw4+IRUtLPUjyCSPpcgqM+ay36mKkC\n", "BKp5XjNZ3ljRYKKeMIvNYj9M7FvyOHH+V4GbE8fvAj7qBXF5DDhAyGBYzqyWz5OeRTTT5p7u9Iod\n", "5HFUVc9kR1rqC4lDgG/37u7TphCg9HVDmpZ0GvtCeHFhbg4sK+PTie9fPnIt6/tmRcDVUyRk9hKN\n", "oKXFheBTOZixbUmLCfBCRv2tU4jL0ZEoGTAE3FKmDX7VVVdNbvfee2+Oj781aRWHbiG0Ny0M3UVt\n", "CiLU58lFuBYuXD7Z8Wbn41rvwYlfSMkfjzLi0OXOzrW+Zs2rorT9W6ORU1fmpM+CEJb6ZtJr0Zer\n", "G0fA1bujb5Xfr5hb3HvvvUV9ZUuLy5Q3DWax1dH+mjJmsfUps9hlJJz6ifMnAgfL3CeP5z6raIU3\n", "20IbsjIX9zkMlkScZb3FZzv9j/LFi9f4mjWvKrl2d/e6jPkngx4W81rlZsvKZhTIupfZ0pYbFUhc\n", "RCOYzeLyx7FQRL6UazLKLAS+G4lHO/Ag8LroszWJctuBT5a5T06PenbRKJt8ufsUOsBhDylVkiHA\n", "xYt8TSV+xWHI8fyUQY/9Jun5Jdkz5/uKzFblBKO4zWHENJW/ZbrnUK/fQSu8PIi5z2wWly7gblKh\n", "yISshF9IlHsLIWXuo8BlifN/ATwU+Vw+HwcHZNwnv6ctigj+hoJwJH0NxW/XYT2Wzs7jJ81HM3nz\n", "Lp6sOHUOrmBeyzbDTWfqSnfabW3lRzlT1YsDGeopAHLoi3oza8WlUZvEpX5kdeRxpz3V23W1Zp2s\n", "ep2da4t8HdlpWXpK6pVz0hcvZjZYkSiUm2Qq05WYzeQhLlOl3BeiLI8//mTZc/Fci0La+MJci8HB\n", "87nvvuJ09NOl7B8ZGeHQoedoa0umq9/Giy+eB6xjbGxnyX1Div/X8sgj32Js7KLE1S7hhBNK570A\n", "7N//T0xM7KaQ/n4du3ffpHkiQlRDrerUyhvzaOTSaFNJVphxPXwUxaORQW9rW55IVz/9yKAQLhzm\n", "vLS3l3fSVzOqaoZZTIh6g8xiEhf35jh5Z9JpV3KtZPvjdC7lsgRX67eZTnirfY6NdOgL0QgkLhIX\n", "d29eeGpeHehUmZcLUWGFKK7u7pPrJqYSBSHyERf5XETV1HeN92OAASYmwOxC3PcA1wLwxBPv48or\n", "L2L//r1AsU+nGkZGRoqWFb7rrs/V2ngh5j0SlzlANU7yViLd/rCm/CcmP1+8eBkvvvhHxI728XHY\n", "v39vVSKQFhKI17v/EAD33TegZI9C5IDEZQ4wVXTWbCAd5fXww68wPv4ssIeOjp28+tU9PPBA7fcZ\n", "GRkpEZKenp5EVmQYG0MRYkLkgMRljlBfE1X9Sba/eHQRRmBBFELZakdmxen1g5A8/vgHam+8EKIE\n", "iYtoObKEsl4jsxNOWM3Y2M5Za1IUolWxEBgwNzEzn8vfT8yMtFmso2Mnd9wRhCTph5nNI0Ah8sDM\n", "cHer6RpzufOVuIg0aYe+hESIUiQu0yBxERITIWaOxGUaJC7zm3JmMAmMEFOTh7i05dUYIVqN4uiw\n", "IDLxKAaC+GzatJVNm7YyMjLStHYKMRdRtJiYl2TNedGoRoj8kLiIOctUmQuy5rxo8qQQ+SFxEXOW\n", "2Z65QIjZjBz6Yl4iZ78Q5VG02DRIXMRUKExZiGwkLtMgcRFCiJmjUGQhhBAticRFCCFE7khchBBC\n", "5E5TxMXMusxsn5l928zuMrOlZcrdambPmdnBauoLIYRoDs0auVwK7HP3k4B7ouMsbgPOrKG+EEKI\n", "JtAscdkCxCsy7QHenlXI3b8E/LDa+kK0GspnJuYLzZqhv8rdn4v2nwNWNbi+EA1H+czEfKJu4mJm\n", "+4DVGR9dnjxwdzezqiejTFd/165dk/v9/f309/dXeyshakL5zESrMjo6yujoaK7XrJu4uPvGcp9F\n", "TvrV7v6sma0Bvj/Dy1dcPykuQgghSkm/eP/hH/5hzddsls9lL/HrW/j5+QbXF6LhDA6eT0fHToKb\n", "cE+Upfn8ZjdLiLrQlPQvZtYFfBY4HngMeIe7/8jMjgFudvdficp9CugDlhNGJ1e6+23l6mfcR+lf\n", "REuhfGZiNqDcYtMgcRFCiJmj3GJCCCFaEomLEEKI3JG4CCGEyB2JixBCiNyRuAghhMgdiYsQQojc\n", "kbgIIYTIHYmLEEKI3JG4CCGEyB2JixBCiNyRuAghhMgdiYsQQojckbgIIYTIHYmLEEKI3JG4CCGE\n", "yB2JixBCiNyRuAghhMgdiYsQQojckbgIIYTIHYmLEEKI3JG4CCGEyB2JixBCiNyRuAghhMidpoiL\n", "mXWZ2T4z+7aZ3WVmS8uUu9XMnjOzg6nzu8zsSTN7INrObEzLhRBCVEKzRi6XAvvc/STgnug4i9uA\n", "LOFw4Dp374224Tq1UwghRBU0S1y2AHui/T3A27MKufuXgB+WuYbVoV1CCCFyoFnissrdn4v2nwNW\n", "VXGNC83sgJndUs6sJoQQojksrNeFzWwfsDrjo8uTB+7uZuYzvPyfAX8U7X8A2A28N6vgrl27Jvf7\n", "+/vp7++f4a2EyJeRkRF2774JgMHB89m8eXOTWyTmO6Ojo4yOjuZ6TXOfab+ew03NHgH63f1ZM1sD\n", "3OvuPWXKngjc6e7rZvq5mXkzvp8Q5RgZGeGsswYYG/sQAB0dO7njjj0SGNFSmBnuXpProVlmsb3A\n", "QLQ/AHx+JpUjQYo5CzhYrqwQrcTu3TdFwjIABJGJRzFCzCWaJS7XABvN7NvAL0XHmNkxZvaFuJCZ\n", "fQr4O+AkM3vCzN4dffQhM3vIzA4AfcD2xjZfCCHEVDTFLNYoZBYTrYbMYmI2kIdZTOIiRIORQ1+0\n", "OhKXaZC4CCHEzJnNDn0hhBBzGImLEEKI3JG4CCGEyB2JixBCiNyRuAghhMgdiYsQQojckbgIIYTI\n", "HYmLEEKI3JG4CCGEyB2JixBCiNyRuAghhMgdiYsQQojckbgIIYTIHYmLEEKI3JG4CCGEyB2JixBC\n", "iNyRuAghhMgdiYsQQojckbgIIYTIHYmLEEKI3GmKuJhZl5ntM7Nvm9ldZrY0o8xaM7vXzB42s6+b\n", "2baZ1BdCCNE8mjVyuRTY5+4nAfdEx2leBra7++uB9cDvmlnPDOq3JKOjo81uQgmt2CZozXapTZWh\n", "NlVOq7arVpolLluAPdH+HuDt6QLu/qy7PxjtvwR8Ezi20vqtSiv+IbVim6A126U2VYbaVDmt2q5a\n", "aZa4rHL356L954BVUxU2sxOBXuCr1dQXQgjRWBbW68Jmtg9YnfHR5ckDd3cz8ymusxj4K+CiaART\n", "xHT1hRBCNB5zb3y/bGaPAP3u/qyZrQHudfeejHKHAX8N/I27f6SK+hIdIYSoAne3WurXbeQyDXuB\n", "AeBD0c/PpwuYmQG3AN9ICkul9aH2hyOEEKI6mjVy6QI+CxwPPAa8w91/ZGbHADe7+6+Y2ZuBvwUe\n", "AuJGXubuw+XqN/hrCCGEKENTxEUIIcTcZtbP0G/FCZmVXtPMbjWz58zsYOr8LjN70sweiLYzW6BN\n", "zXxOZ5rZI2b2HTPbmTif23Mqd49UmRuizw+YWe9M6japXY+Z2UPRs/lao9pkZj1m9hUz+4mZDc70\n", "+zShTc16Tr8e/c4eMrMvm9nPVlq3SW2a2XNy91m9AX8MvD/a3wlck1FmNXBatL8Y+BbQU2n9erQp\n", "+uw/E0KsD6bOXwXsaPRzmqZNTXlOwALgUeBE4DDgQeB1eT6nqe6RKPNW4IvR/i8Af19p3Wa0Kzr+\n", "HtCV899RJW1aCbwBGAIGZ1K30W1q8nN6E7Ak2j+z3n9TtbSpmuc060cutOaEzIqu6e5fAn5Y5hp5\n", "ByPU2qZmPac3Ao+6+2Pu/jLwaeBtic/zeE7T3aOore7+VWCpma2usG6j25Wc95X339G0bXL3H7j7\n", "/YQsGzOq24Q2xTTjOX3F3f81OvwqcFyldZvQppiKn9NcEJdWnJCZxzUvjIant+RhgsqhTc16TscC\n", "TySOn6TwYgD5PKfp7jFVmWMqqFsttbQLQiDM3WZ2v5md18A21aNuPa/bCs/pvcAXq6zbiDbBDJ9T\n", "s0KRZ4S14ITMvNpUhj8D/ija/wCwm/CLbmabqqqfQ5umuk9Vz2mG90jS6ND2Wtv1Znd/2sxWAvvM\n", "7JFoZNqINuVdt57X3eDuzzTrOZnZLwLvATbMtO4MqaVNMMPnNCvExd03lvvMgvN5tRcmVH6/TLnD\n", "gM8Bn3D35LyYiurXo01TXHuyvJn9OXBns9tE857TU8DaxPFawhtX1c9pJveYosxxUZnDKqhbLdW2\n", "6ykAd386+vkDM7uDYBaptdOspE31qFu367r7M9HPhj+nyGF+M3Cmu/9wJnUb3KYZP6e5YBaLJ1RC\n", "bRMyy9avR5umIupoY84CDpYr26g25VC/2mveD7zGzE40s3bgnKhens+p7D1Sbf0f0X3XAz+KTHqV\n", "1K2WqttlZovMrDM6fySwiXz+jmbyfdMjqno9q6rb1MznZGbHA/8XeJe7P1rl92lIm6p6TrVGIDR7\n", "A7qAu4FvA3cBS6PzxwBfiPbfDEwQoiMeiLYzp6pf7zZFx58Cngb+g2ALfXd0/i8Ik0cPEDrcVS3Q\n", "pmY+p7cQIvweJUykjc/n9pyy7gFcAFyQKPOx6PMDwOnTtS+nv++q2gW8Kvp7fxD4ep7tmq5NBDPo\n", "E8C/EoJD/gVYXM9nVW2bmvyc/hx4nkKf9LV6/01V26ZqnpMmUQohhMiduWAWE0II0WJIXIQQQuSO\n", "xEUIIUTuSFyEEELkjsRFCCFE7khchBBC5I7ERYgpMLMJM/t44nihmf3AzO6Mjv9bHinRzazNQur8\n", "g1Fa869FefCmqvPfzewbZnaPmfWZ2ZtqbYcQeTEr0r8I0UT+HXi9mR3h7j8BNhJSZjiAu99JhWln\n", "okwRePbksnOANe6+Lip7DPDjaS75XuC33P3vzGwX8CLwlUraIkS90chFiOn5IvAr0f65hCwGBmBm\n", "v2lmH432V5nZHWb2YLStj1JtfMvM9hDSZaw1sw8nRijviK67GngmvqG7P+3R0t1mdm5U9qCZXROd\n", "u5KQVPBWM/ssYZb1dgsLOb253g9EiOmQuAgxPZ8B3mlmhwPrKCzXkOYG4F53Pw04HfhGdP7VwJ+6\n", "+ynAzwOnAj8LnAF82ML6K58F/lskDtea2WkwOYK5BvhF4DTg583sbe7+R4RcUb/m7u8AbgSuc/de\n", "d78v7wcgxEyRuAgxDe5+kLB637nAF6Yo+ouEZQBw9wl3/7fo/OPuHi8LuwH4pAe+D+wH3ujuTwGv\n", "BS4j5MG7x8x+iSBGo+7+vLv/FPhL4L+UuX+jlwQQoizyuQhRGXuBa4E+wpK55cjq4P99mjKx/2Yc\n", "GAaGzew5wsqcd2fUVUJA0fJo5CJEZdwK7HL3h6cocw/wOwBmtsDMjsoo8yXgnCg6bCVhFPI1M+uN\n", "TGCYWRvBdPYY8DWgz8yWm9kC4J2E0U6aF4HO6r6aEPkjcRFiauJRxVPu/rHEOc/Yvwj4RTN7iOAP\n", "eV3yGtF17qCwTMA9wPsi89jRwF4zOxh9Ng58zN2fBS4F7iWkO78/ilBLcydwVuSz2ZDxuRANRSn3\n", "hRBC5I5GLkIIIXJH4iKEECJ3JC5CCCFyR+IihBAidyQuQgghckfiIoQQInckLkIIIXJH4iKEECJ3\n", "/j9j16Co5MuwagAAAABJRU5ErkJggg==\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# MS와 IBM의 산포도\n", "fig = plt.figure()\n", "ax = fig.add_subplot(1, 1, 1)\n", "ax.scatter(returns.MSFT, returns.IBM)\n", "ax.set_xlabel('MicroSoft')\n", "ax.set_ylabel('IBM')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.49608477623932762" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# MicroSoft와 IBM의 주식 퍼센트 변화율의 상관관계 구하기\n", "returns.MSFT.corr(returns.IBM)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.00021609902459138997" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 공분산\n", "returns.MSFT.cov(returns.IBM)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLIBMMSFT
AAPL 1.000000 0.409605 0.423189
IBM 0.409605 1.000000 0.496085
MSFT 0.423189 0.496085 1.000000
\n", "
" ], "text/plain": [ " AAPL IBM MSFT\n", "AAPL 1.000000 0.409605 0.423189\n", "IBM 0.409605 1.000000 0.496085\n", "MSFT 0.423189 0.496085 1.000000" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corr()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLIBMMSFT
AAPL 0.001032 0.000252 0.000309
IBM 0.000252 0.000367 0.000216
MSFT 0.000309 0.000216 0.000517
\n", "
" ], "text/plain": [ " AAPL IBM MSFT\n", "AAPL 0.001032 0.000252 0.000309\n", "IBM 0.000252 0.000367 0.000216\n", "MSFT 0.000309 0.000216 0.000517" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.cov()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AAPL 0.409605\n", "IBM 1.000000\n", "MSFT 0.496085\n", "dtype: float64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corrwith(returns.IBM)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLIBMMSFT
Date
2000-01-03 133949200 10347700 53228400
2000-01-04 128094400 8227800 54119000
2000-01-05 194580400 12733200 64059600
2000-01-06 191993200 7971900 54976600
2000-01-07 115183600 11856700 62013600
\n", "
" ], "text/plain": [ " AAPL IBM MSFT\n", "Date \n", "2000-01-03 133949200 10347700 53228400\n", "2000-01-04 128094400 8227800 54119000\n", "2000-01-05 194580400 12733200 64059600\n", "2000-01-06 191993200 7971900 54976600\n", "2000-01-07 115183600 11856700 62013600" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "volume.head()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AAPL -0.057955\n", "IBM -0.007892\n", "MSFT -0.014295\n", "dtype: float64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 시가총액과 주식 퍼센트 변화율간의 상관관계\n", "# DataFrame을 넘기면 맞아떨어지는 이름의 칼럼의 이름에 대한 상관관계를 계산\n", "returns.corrwith(volume)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 유일 값, 카운트, 멤버십" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 c\n", "1 a\n", "2 d\n", "3 a\n", "4 a\n", "5 b\n", "6 b\n", "7 c\n", "8 c\n", "dtype: object" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 유일 값\n", "obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])\n", "obj" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['c', 'a', 'd', 'b'], dtype=object)" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uniques = obj.unique()\n", "uniques" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 3\n", "a 3\n", "b 2\n", "d 1\n", "dtype: int64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 카운트\n", "obj.value_counts()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 3\n", "a 3\n", "b 2\n", "d 1\n", "dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(obj.values, sort=True)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 c\n", "1 a\n", "2 d\n", "3 a\n", "4 a\n", "5 b\n", "6 b\n", "7 c\n", "8 c\n", "dtype: object" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 멤버십\n", "obj" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 True\n", "6 True\n", "7 True\n", "8 True\n", "dtype: bool" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = obj.isin(['b', 'c'])\n", "mask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 누락값 처리" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 누락값 찾기 : isnull\n", "* 누락값 제거 : dropna\n", "* 누락값 채우기 : fillna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 누락값 찾기 : isnull" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* pandas는 누락된 데이터를 실수든 아니든 모두 NaN으로 취급" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 aardvark\n", "1 artichoke\n", "2 NaN\n", "3 avocado\n", "dtype: object" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])\n", "string_data" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 None\n", "1 artichoke\n", "2 NaN\n", "3 avocado\n", "dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data[0] = None\n", "string_data" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "dtype: bool" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_data.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 누락값 제거 : dropna" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 3.5\n", "3 NaN\n", "4 7.0\n", "dtype: float64" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from numpy import nan as NA\n", "\n", "data = Series([1, NA, 3.5, NA, 7])\n", "data" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "2 3.5\n", "4 7.0\n", "dtype: float64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1 6.5 3
1 1 NaNNaN
2NaN NaNNaN
3NaN 9.0 2
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 6.5 3\n", "1 1 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 9.0 2" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = DataFrame([[1., 6.5, 3.], [1., NA, NA],\n", " [NA, NA, NA], [NA, 9, 2]])\n", "data" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1 6.5 3
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 6.5 3" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cleaned = data.dropna()\n", "cleaned" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1 6.5 3
1 1 NaNNaN
3NaN 9.0 2
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 6.5 3\n", "1 1 NaN NaN\n", "3 NaN 9.0 2" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 모든 값이 NA인 로우만 제외\n", "data.dropna(how='all')" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1 6.5 3
3NaN 9.0 2
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 6.5 3\n", "3 NaN 9.0 2" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 몇개 이상 값이 들어있는가를 기준으로\n", "data.dropna(thresh=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 누락값 채우기 : fillna" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.864298 NaN NaN
1-0.056151 NaN NaN
2 0.226936 NaN NaN
3 1.825479 NaN 0.528780
4-0.040804 NaN-0.996493
5 1.586317-0.004980-0.039019
6-0.311639-2.297522-0.975941
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.864298 NaN NaN\n", "1 -0.056151 NaN NaN\n", "2 0.226936 NaN NaN\n", "3 1.825479 NaN 0.528780\n", "4 -0.040804 NaN -0.996493\n", "5 1.586317 -0.004980 -0.039019\n", "6 -0.311639 -2.297522 -0.975941" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame(np.random.randn(7, 3))\n", "df.ix[:4, 1] = NA; df.ix[:2, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.864298 0.000000 0.000000
1-0.056151 0.000000 0.000000
2 0.226936 0.000000 0.000000
3 1.825479 0.000000 0.528780
4-0.040804 0.000000-0.996493
5 1.586317-0.004980-0.039019
6-0.311639-2.297522-0.975941
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.864298 0.000000 0.000000\n", "1 -0.056151 0.000000 0.000000\n", "2 0.226936 0.000000 0.000000\n", "3 1.825479 0.000000 0.528780\n", "4 -0.040804 0.000000 -0.996493\n", "5 1.586317 -0.004980 -0.039019\n", "6 -0.311639 -2.297522 -0.975941" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.864298 0.500000 NaN
1-0.056151 0.500000 NaN
2 0.226936 0.500000 NaN
3 1.825479 0.500000 0.528780
4-0.040804 0.500000-0.996493
5 1.586317-0.004980-0.039019
6-0.311639-2.297522-0.975941
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.864298 0.500000 NaN\n", "1 -0.056151 0.500000 NaN\n", "2 0.226936 0.500000 NaN\n", "3 1.825479 0.500000 0.528780\n", "4 -0.040804 0.500000 -0.996493\n", "5 1.586317 -0.004980 -0.039019\n", "6 -0.311639 -2.297522 -0.975941" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# key는 컬럼 색인을 뜻함\n", "df.fillna({1: 0.5, 3: -1})" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1.070430 0.929998 1.464021
1 0.834314 0.040356-0.028218
2-1.251520 NaN 0.568477
3-0.546254 NaN-0.464190
4-0.729633 NaN NaN
5-1.925968 NaN NaN
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.070430 0.929998 1.464021\n", "1 0.834314 0.040356 -0.028218\n", "2 -1.251520 NaN 0.568477\n", "3 -0.546254 NaN -0.464190\n", "4 -0.729633 NaN NaN\n", "5 -1.925968 NaN NaN" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 보간법 가능\n", "df = DataFrame(np.random.randn(6, 3))\n", "df.ix[2:, 1] = NA; df.ix[4:, 2] = NA\n", "df" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1.070430 0.929998 1.464021
1 0.834314 0.040356-0.028218
2-1.251520 0.040356 0.568477
3-0.546254 0.040356-0.464190
4-0.729633 0.040356-0.464190
5-1.925968 0.040356-0.464190
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.070430 0.929998 1.464021\n", "1 0.834314 0.040356 -0.028218\n", "2 -1.251520 0.040356 0.568477\n", "3 -0.546254 0.040356 -0.464190\n", "4 -0.729633 0.040356 -0.464190\n", "5 -1.925968 0.040356 -0.464190" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill')" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1.070430 0.929998 1.464021
1 0.834314 0.040356-0.028218
2-1.251520 0.485177 0.568477
3-0.546254 0.485177-0.464190
4-0.729633 0.485177 0.385023
5-1.925968 0.485177 0.385023
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.070430 0.929998 1.464021\n", "1 0.834314 0.040356 -0.028218\n", "2 -1.251520 0.485177 0.568477\n", "3 -0.546254 0.485177 -0.464190\n", "4 -0.729633 0.485177 0.385023\n", "5 -1.925968 0.485177 0.385023" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 함수도 넘길 수 있다.\n", "df.fillna(df.mean(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 계층적 색인" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1 0.299331\n", " 2 -2.320418\n", " 3 -1.333153\n", "b 1 0.757166\n", " 2 -1.473082\n", " 3 -0.300510\n", "c 1 0.986354\n", " 2 1.393783\n", "d 2 -1.309509\n", " 3 -0.417365\n", "dtype: float64" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = Series(np.random.randn(10),\n", " index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],\n", " [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])\n", "data" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],\n", " labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 0.757166\n", "2 -1.473082\n", "3 -0.300510\n", "dtype: float64" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['b']" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1 0.757166\n", " 2 -1.473082\n", " 3 -0.300510\n", "c 1 0.986354\n", " 2 1.393783\n", "dtype: float64" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['b':'c']" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
123
a 0.299331-2.320418-1.333153
b 0.757166-1.473082-0.300510
c 0.986354 1.393783 NaN
d NaN-1.309509-0.417365
\n", "
" ], "text/plain": [ " 1 2 3\n", "a 0.299331 -2.320418 -1.333153\n", "b 0.757166 -1.473082 -0.300510\n", "c 0.986354 1.393783 NaN\n", "d NaN -1.309509 -0.417365" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 계층 색인 풀기 - wide 형\n", "data.unstack()" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1 0.299331\n", " 2 -2.320418\n", " 3 -1.333153\n", "b 1 0.757166\n", " 2 -1.473082\n", " 3 -0.300510\n", "c 1 0.986354\n", " 2 1.393783\n", "d 2 -1.309509\n", " 3 -0.417365\n", "dtype: float64" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 데이터프레임 로우,칼럼 색인을 계층 색인으로 묶어내기 - long 형\n", "data.unstack().stack()" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioColorado
GreenRedGreen
a1 0 1 2
2 3 4 5
b1 6 7 8
2 9 10 11
\n", "
" ], "text/plain": [ " Ohio Colorado\n", " Green Red Green\n", "a 1 0 1 2\n", " 2 3 4 5\n", "b 1 6 7 8\n", " 2 9 10 11" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = DataFrame(np.arange(12).reshape((4, 3)),\n", " index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],\n", " columns=[['Ohio', 'Ohio', 'Colorado'],\n", " ['Green', 'Red', 'Green']])\n", "frame" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OhioColorado
GreenRedGreen
key1key2
a1 0 1 2
2 3 4 5
b1 6 7 8
2 9 10 11
\n", "
" ], "text/plain": [ " Ohio Colorado\n", " Green Red Green\n", "key1 key2 \n", "a 1 0 1 2\n", " 2 3 4 5\n", "b 1 6 7 8\n", " 2 9 10 11" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 계층 색인에 이름 주기\n", "frame.index.names = ['key1', 'key2']\n", "frame" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateOhioColorado
colorGreenRedGreen
key1key2
a1 0 1 2
2 3 4 5
b1 6 7 8
2 9 10 11
\n", "
" ], "text/plain": [ "state Ohio Colorado\n", "color Green Red Green\n", "key1 key2 \n", "a 1 0 1 2\n", " 2 3 4 5\n", "b 1 6 7 8\n", " 2 9 10 11" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.columns.names = ['state', 'color']\n", "frame" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorGreenRed
key1key2
a1 0 1
2 3 4
b1 6 7
2 9 10
\n", "
" ], "text/plain": [ "color Green Red\n", "key1 key2 \n", "a 1 0 1\n", " 2 3 4\n", "b 1 6 7\n", " 2 9 10" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame['Ohio']" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorGreenRed
key2
1 0 1
2 3 4
\n", "
" ], "text/plain": [ "color Green Red\n", "key2 \n", "1 0 1\n", "2 3 4" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame['Ohio'].ix['a']" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateOhioColorado
colorGreenRedGreen
key2
1 6 8 10
2 12 14 16
\n", "
" ], "text/plain": [ "state Ohio Colorado\n", "color Green Red Green\n", "key2 \n", "1 6 8 10\n", "2 12 14 16" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sum(level='key2')" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorGreenRed
key1key2
a1 2 1
2 8 4
b1 14 7
2 20 10
\n", "
" ], "text/plain": [ "color Green Red\n", "key1 key2 \n", "a 1 2 1\n", " 2 8 4\n", "b 1 14 7\n", " 2 20 10" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.sum(level='color', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 로우 색인과 컬럼 색인 교환" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
\n", "
" ], "text/plain": [ " a b c d\n", "0 0 7 one 0\n", "1 1 6 one 1\n", "2 2 5 one 2\n", "3 3 4 two 0\n", "4 4 3 two 1\n", "5 5 2 two 2\n", "6 6 1 two 3" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame = DataFrame({'a': range(7),\n", " 'b': range(7, 0, -1),\n", " 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],\n", " 'd': [0, 1, 2, 0, 1, 2, 3]})\n", "frame" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
cd
one0 0 7
1 1 6
2 2 5
two0 3 4
1 4 3
2 5 2
3 6 1
\n", "
" ], "text/plain": [ " a b\n", "c d \n", "one 0 0 7\n", " 1 1 6\n", " 2 2 5\n", "two 0 3 4\n", " 1 4 3\n", " 2 5 2\n", " 3 6 1" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#컬럼 색인을 로우 색인으로 옮길 수 있다.\n", "frame2 = frame.set_index(['c', 'd'])\n", "frame2" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cdab
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
\n", "
" ], "text/plain": [ " c d a b\n", "0 one 0 0 7\n", "1 one 1 1 6\n", "2 one 2 2 5\n", "3 two 0 3 4\n", "4 two 1 4 3\n", "5 two 2 5 2\n", "6 two 3 6 1" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#로우 색인 -> 컬럼 색인으로\n", "frame2.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 데이터 입출력" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 파일 읽기\n", "* 파일 쓰기\n", "* 웹 내용 긁어오기\n", "* 데이터베이스 사용" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 파일 읽기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "함수 | 설명 |\n", "---------- | ----------|\n", "read_csv | 파일, URL 또는 파일과 유사한 객체로부터 구분된 데이터 읽어옴. 구분자는 쉼표(,)를 기본으로.\n", "read_table | read_csv와 유사. 구분자는 탭('\\t')을 기본으로.\n", "read_fwf | 고정폭 컬럼 형식에서 읽어옴(구분자 없는 경우)\n", "read_clipboard | 클립보드에 있는 데이터를 읽어오는 read_table 함수. 웹페이지에서 표를 긁어올 때 유용" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import DataFrame, Series\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a,b,c,d,message\r\n", "1,2,3,4,hello\r\n", "5,6,7,8,world\r\n", "9,10,11,12,foo\r\n" ] } ], "source": [ "!cat ex1.csv" ] }, { "cell_type": "code", "execution_count": 159, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world\n", "2 9 10 11 12 foo" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('ex1.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world\n", "2 9 10 11 12 foo" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_table('ex1.csv', sep=',')" ] }, { "cell_type": "code", "execution_count": 161, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
0 a b c d message
1 1 2 3 4 hello
2 5 6 7 8 world
3 9 10 11 12 foo
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 a b c d message\n", "1 1 2 3 4 hello\n", "2 5 6 7 8 world\n", "3 9 10 11 12 foo" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('ex1.csv', header=None)" ] }, { "cell_type": "code", "execution_count": 164, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
defmessage2
a b c d message
1 2 3 4 hello
5 6 7 8 world
9 10 11 12 foo
\n", "
" ], "text/plain": [ " d e f message2\n", "a b c d message\n", "1 2 3 4 hello\n", "5 6 7 8 world\n", "9 10 11 12 foo" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('ex1.csv', header=None, names=['d', 'e', 'f', 'message2'])" ] }, { "cell_type": "code", "execution_count": 165, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
0 1 2 3 4 hello
1 5 6 7 8 world
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 몇 줄만 읽어오기\n", "pd.read_csv('ex1.csv', nrows=2)" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 로우 단위로, 여러 조각으로 읽어오기\n", "chunker = pd.read_csv('ex6.csv', chunksize=1000)\n", "chunker" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10001 03_data/file/ex6.csv\r\n" ] } ], "source": [ "!wc -l ex6.csv" ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "one,two,three,four,key\r\n", "0.467976300189,-0.0386485396255,-0.295344251987,-1.82472622729,L\r\n", "-0.358893469543,1.40445260007,0.704964644926,-0.200638304015,B\r\n", "-0.50184039929,0.659253707223,-0.421690619312,-0.0576883018364,G\r\n", "0.204886212202,1.07413396504,1.38836131252,-0.982404023494,R\r\n", "0.354627914484,-0.133115852296,0.283762637978,-0.837062961653,Q\r\n", "1.81748001608,0.742272722638,0.419394843928,-2.25103520513,Q\r\n", "-0.776764319165,0.935517747061,-0.332871759623,-1.87564085416,U\r\n", "-0.913134961617,1.53062351168,-0.572656719239,0.477252252981,K\r\n", "0.358479538224,-0.49757199147,-0.367016188009,0.507701778685,S\r\n" ] } ], "source": [ "!head ex6.csv" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "E 368\n", "X 364\n", "L 346\n", "O 343\n", "Q 340\n", "M 338\n", "J 337\n", "F 335\n", "K 334\n", "H 330\n", "V 328\n", "I 327\n", "U 326\n", "P 324\n", "D 320\n", "A 320\n", "R 318\n", "Y 314\n", "G 308\n", "S 308\n", "N 306\n", "W 305\n", "T 304\n", "B 302\n", "Z 288\n", "C 286\n", "4 171\n", "6 166\n", "7 164\n", "8 162\n", "3 162\n", "5 157\n", "2 152\n", "0 151\n", "9 150\n", "1 146\n", "dtype: float64" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# chunker로 쪼개진 파일을 순회하면서, key 칼럼에 있는 값 세어보기\n", "tot = Series([])\n", "for piece in chunker :\n", " tot = tot.add(piece['key'].value_counts(), fill_value=0)\n", " \n", "tot = tot.order(ascending=False)\n", "tot" ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefourkey
0 0.467976-0.038649-0.295344-1.824726 B
1-0.358893 1.404453 0.704965-0.200638 M
2-0.501840 0.659254-0.421691-0.057688 N
3 0.204886 1.074134 1.388361-0.982404 N
4 0.354628-0.133116 0.283763-0.837063 Y
\n", "
" ], "text/plain": [ " one two three four key\n", "0 0.467976 -0.038649 -0.295344 -1.824726 B\n", "1 -0.358893 1.404453 0.704965 -0.200638 M\n", "2 -0.501840 0.659254 -0.421691 -0.057688 N\n", "3 0.204886 1.074134 1.388361 -0.982404 N\n", "4 0.354628 -0.133116 0.283763 -0.837063 Y" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "piece.head()" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "K 42\n", "M 41\n", "U 39\n", "Y 38\n", "P 38\n", "E 35\n", "R 34\n", "Q 34\n", "I 33\n", "B 33\n", "T 33\n", "F 33\n", "H 33\n", "A 32\n", "G 31\n", "Z 31\n", "L 30\n", "J 29\n", "N 29\n", "S 29\n", "X 28\n", "W 28\n", "C 28\n", "3 26\n", "D 25\n", "V 23\n", "6 22\n", "4 21\n", "O 19\n", "0 17\n", "2 15\n", "8 15\n", "5 15\n", "9 15\n", "7 14\n", "1 12\n", "dtype: int64" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "piece['key'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 파일 쓰기" ] }, { "cell_type": "code", "execution_count": 178, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world\n", "2 9 10 11 12 foo" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('ex1.csv')\n", "data" ] }, { "cell_type": "code", "execution_count": 179, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 쓰기\n", "data.to_csv('out.csv')" ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",a,b,c,d,message\r\n", "0,1,2,3,4,hello\r\n", "1,5,6,7,8,world\r\n", "2,9,10,11,12,foo\r\n" ] } ], "source": [ "!cat out.csv" ] }, { "cell_type": "code", "execution_count": 181, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 다른 구분자 사용\n", "data.to_csv('out2.csv', sep='|')" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|a|b|c|d|message\r\n", "0|1|2|3|4|hello\r\n", "1|5|6|7|8|world\r\n", "2|9|10|11|12|foo\r\n" ] } ], "source": [ "!cat out2.csv" ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 컬럼의 일부분반 저장\n", "data.to_csv('out3.csv', columns=['a', 'b', 'd'])" ] }, { "cell_type": "code", "execution_count": 186, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",a,b,d\r\n", "0,1,2,4\r\n", "1,5,6,8\r\n", "2,9,10,12\r\n" ] } ], "source": [ "!cat out3.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 웹 내용 긁어오기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* json 데이터 다루기\n", "* 웹 API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### json 데이터 다루기" ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# JSON 데이터\n", "obj = \"\"\"\n", "{\n", " \"name\": \"Wes\",\n", " \"places_lived\": [\"United States\", \"Spain\", \"Germany\"],\n", " \"pet\": null, \"siblings\": [{\"name\": \"Scott\", \"age\":25, \"pet\":\"Zuko\"},\n", " {\"name\": \"Katie\", \"age\":33, \"pet\": \"Cisco\"}]\n", "}\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{u'name': u'Wes',\n", " u'pet': None,\n", " u'places_lived': [u'United States', u'Spain', u'Germany'],\n", " u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},\n", " {u'age': 33, u'name': u'Katie', u'pet': u'Cisco'}]}" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "result = json.loads(obj)\n", "result" ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameage
0 Scott 25
1 Katie 33
\n", "
" ], "text/plain": [ " name age\n", "0 Scott 25\n", "1 Katie 33" ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# JSON 객체의 리스트를 DataFrame 생성자로 넘기고 데이터 필드 선택 가능\n", "siblings = DataFrame(result['siblings'], columns=['name', 'age'])\n", "siblings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 웹 API" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import requests\n", "\n", "url = 'http://www.burgerking.co.kr/api/store/searchmap/empty/?areacd='\n", "resp = requests.get(url)\n", "resp" ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import json\n", "data = json.loads(resp.text)" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{u'AllHour': u'N',\n", " u'CloseCleaning': u'',\n", " u'ClosePeakSeason': u'',\n", " u'CloseWeekday': u'22:00',\n", " u'Delivery': u'N',\n", " u'DriveThrough': u'N',\n", " u'Morning': u'N',\n", " u'NewAddr': u'\\uc11c\\uc6b8\\ud2b9\\ubcc4\\uc2dc \\uae08\\ucc9c\\uad6c \\uac00\\uc0b0\\ub514\\uc9c0\\ud1381\\ub85c ',\n", " u'NewAddr2': u'168 \\uc6b0\\ub9bc\\ub77c\\uc774\\uc628\\uc2a4\\ubc38\\ub9ac A\\ub3d9',\n", " u'OpenTime': u'9:00',\n", " u'PhoneNumber': u'02-853-0332',\n", " u'PointX': 37.4799652,\n", " u'PointY': 126.882637,\n", " u'StoreNM': u'\\uac00\\uc0b0\\ub514\\uc9c0\\ud138\\uc810',\n", " u'StoreSQ': 1},\n", " {u'AllHour': u'N',\n", " u'CloseCleaning': u'',\n", " u'ClosePeakSeason': u'/ \\uae08, \\ud1a0: 11:00~24:00',\n", " u'CloseWeekday': u'22:30',\n", " u'Delivery': u'N',\n", " u'DriveThrough': u'N',\n", " u'Morning': u'N',\n", " u'NewAddr': u'\\uc11c\\uc6b8\\ud2b9\\ubcc4\\uc2dc \\uae08\\ucc9c\\uad6c \\ub514\\uc9c0\\ud138\\ub85c ',\n", " u'NewAddr2': u'10\\uae38 9 \\ud604\\ub300\\uc544\\uc6b8\\ub81b 6\\uce35',\n", " u'OpenTime': u'11:00',\n", " u'PhoneNumber': u'02-2136-9962',\n", " u'PointX': 37.47762,\n", " u'PointY': 126.889053,\n", " u'StoreNM': u'\\uac00\\uc0b0\\ud604\\ub300\\uc544\\uc6b8\\ub81b\\uc810',\n", " u'StoreSQ': 2},\n", " {u'AllHour': u'N',\n", " u'CloseCleaning': u'',\n", " u'ClosePeakSeason': u'',\n", " u'CloseWeekday': u'23:30',\n", " u'Delivery': u'N',\n", " u'DriveThrough': u'N',\n", " u'Morning': u'N',\n", " u'NewAddr': u'\\uc11c\\uc6b8\\ud2b9\\ubcc4\\uc2dc \\uac15\\uc11c\\uad6c \\uc591\\ucc9c\\ub85c',\n", " u'NewAddr2': u'559 \\uc774\\ub9c8\\ud2b83\\uce35 (\\uac00\\uc591\\ub3d9)',\n", " u'OpenTime': u'10:00',\n", " u'PhoneNumber': u'02-3664-0221',\n", " u'PointX': 37.5581932,\n", " u'PointY': 126.861816,\n", " u'StoreNM': u'\\uac00\\uc591\\uc774\\ub9c8\\ud2b8\\uc810',\n", " u'StoreSQ': 3}]" ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[:3]" ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'Delivery',\n", " u'OpenTime',\n", " u'ClosePeakSeason',\n", " u'NewAddr',\n", " u'CloseWeekday',\n", " u'StoreNM',\n", " u'DriveThrough',\n", " u'Morning',\n", " u'StoreSQ',\n", " u'AllHour',\n", " u'PhoneNumber',\n", " u'NewAddr2',\n", " u'PointX',\n", " u'PointY',\n", " u'CloseCleaning']" ] }, "execution_count": 238, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[0].keys()" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fields = ['StoreNM', 'Delivery', 'OpenTime', 'PointX', 'PointY']\n", "burger_store = DataFrame(data, columns=fields)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreNMDeliveryOpenTimePointXPointY
0 가산디지털점 N 9:00 37.479965 126.882637
1 가산현대아울렛점 N 11:00 37.477620 126.889053
2 가양이마트점 N 10:00 37.558193 126.861816
3 강남NC점 N 37.509400 127.007309
4 강남교보점 N 37.504570 127.023590
\n", "
" ], "text/plain": [ " StoreNM Delivery OpenTime PointX PointY\n", "0 가산디지털점 N 9:00 37.479965 126.882637\n", "1 가산현대아울렛점 N 11:00 37.477620 126.889053\n", "2 가양이마트점 N 10:00 37.558193 126.861816\n", "3 강남NC점 N 37.509400 127.007309\n", "4 강남교보점 N 37.504570 127.023590" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "burger_store.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 데이터 다듬기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 데이터 합치기 \n", "* 재형성과 피벗\n", "* 데이터 변형\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 데이터 합치기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* merge\n", "* concat\n", "* combine_first" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### merge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 하나 이상의 키를 기준으로 DataFrame의 로우를 합친다. SQL이나 다른 관계형 데이터베이스의 join 연산과 비슷하다.\n", "* 기본적으로 내부조인inner join을 수행하여 교집합을 반환한다.\n", "* how 인자로 왼쪽 우선 외부조인(left), 오른쪽 우선 외부조인(right), 완전 외부조인(outer)를 수행할 수 있다. \n", "* 색인도 merge key가 될 수 있다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 내부조인inner join" ] }, { "cell_type": "code", "execution_count": 259, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
\n", "
" ], "text/plain": [ " data1 key\n", "0 0 b\n", "1 1 b\n", "2 2 a\n", "3 3 c\n", "4 4 a\n", "5 5 a\n", "6 6 b" ] }, "execution_count": 259, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = DataFrame({'key' : ['b','b','a','c','a','a','b'],\n", " 'data1': range(7)})\n", "df1" ] }, { "cell_type": "code", "execution_count": 260, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data2key
0 0 a
1 1 b
2 2 d
\n", "
" ], "text/plain": [ " data2 key\n", "0 0 a\n", "1 1 b\n", "2 2 d" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = DataFrame({'key' : ['a','b','d'],\n", " 'data2': range(3)})\n", "df2" ] }, { "cell_type": "code", "execution_count": 252, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1keydata2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
\n", "
" ], "text/plain": [ " data1 key data2\n", "0 0 b 1\n", "1 1 b 1\n", "2 6 b 1\n", "3 2 a 0\n", "4 4 a 0\n", "5 5 a 0" ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 암묵적으로 겹치는 컬럼을 키로 사용해서 합침\n", "pd.merge(df1,df2)" ] }, { "cell_type": "code", "execution_count": 253, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1keydata2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
\n", "
" ], "text/plain": [ " data1 key data2\n", "0 0 b 1\n", "1 1 b 1\n", "2 6 b 1\n", "3 2 a 0\n", "4 4 a 0\n", "5 5 a 0" ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 명시적으로 키를 설정할 수 있다.\n", "pd.merge(df1,df2, on='key')" ] }, { "cell_type": "code", "execution_count": 256, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
\n", "
" ], "text/plain": [ " data1 lkey\n", "0 0 b\n", "1 1 b\n", "2 2 a\n", "3 3 c\n", "4 4 a\n", "5 5 a\n", "6 6 b" ] }, "execution_count": 256, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 공통되는 칼럼이 없드면 정해줄수 있다.\n", "\n", "df3 = DataFrame({'lkey' : ['b','b','a','c','a','a','b'],\n", " 'data1': range(7)})\n", "df3" ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data2rkey
0 0 a
1 1 b
2 2 d
\n", "
" ], "text/plain": [ " data2 rkey\n", "0 0 a\n", "1 1 b\n", "2 2 d" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4 = DataFrame({'rkey' : ['a','b','d'],\n", " 'data2': range(3)})\n", "df4" ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1lkeydata2rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
\n", "
" ], "text/plain": [ " data1 lkey data2 rkey\n", "0 0 b 1 b\n", "1 1 b 1 b\n", "2 6 b 1 b\n", "3 2 a 0 a\n", "4 4 a 0 a\n", "5 5 a 0 a" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 따로 정해줌\n", "pd.merge(df3, df4, left_on='lkey', right_on='rkey')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 완전 외부조인(outer)" ] }, { "cell_type": "code", "execution_count": 261, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
\n", "
" ], "text/plain": [ " data1 key\n", "0 0 b\n", "1 1 b\n", "2 2 a\n", "3 3 c\n", "4 4 a\n", "5 5 a\n", "6 6 b" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 262, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data2key
0 0 a
1 1 b
2 2 d
\n", "
" ], "text/plain": [ " data2 key\n", "0 0 a\n", "1 1 b\n", "2 2 d" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 263, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1keydata2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
6 3 cNaN
7NaN d 2
\n", "
" ], "text/plain": [ " data1 key data2\n", "0 0 b 1\n", "1 1 b 1\n", "2 6 b 1\n", "3 2 a 0\n", "4 4 a 0\n", "5 5 a 0\n", "6 3 c NaN\n", "7 NaN d 2" ] }, "execution_count": 263, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 왼쪽 우선 외부조인(left)" ] }, { "cell_type": "code", "execution_count": 264, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df1 = DataFrame({'key' : ['b','b','a','c','a','b'],\n", " 'data1': range(6)})\n", "df2 = DataFrame({'key' : ['a','b','a', 'b', 'd'],\n", " 'data2': range(5)})" ] }, { "cell_type": "code", "execution_count": 266, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
\n", "
" ], "text/plain": [ " data1 key\n", "0 0 b\n", "1 1 b\n", "2 2 a\n", "3 3 c\n", "4 4 a\n", "5 5 b" ] }, "execution_count": 266, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 267, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data2key
0 0 a
1 1 b
2 2 a
3 3 b
4 4 d
\n", "
" ], "text/plain": [ " data2 key\n", "0 0 a\n", "1 1 b\n", "2 2 a\n", "3 3 b\n", "4 4 d" ] }, "execution_count": 267, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 268, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1keydata2
0 0 b 1
1 0 b 3
2 1 b 1
3 1 b 3
4 2 a 0
5 2 a 2
6 3 cNaN
7 4 a 0
8 4 a 2
9 5 b 1
10 5 b 3
\n", "
" ], "text/plain": [ " data1 key data2\n", "0 0 b 1\n", "1 0 b 3\n", "2 1 b 1\n", "3 1 b 3\n", "4 2 a 0\n", "5 2 a 2\n", "6 3 c NaN\n", "7 4 a 0\n", "8 4 a 2\n", "9 5 b 1\n", "10 5 b 3" ] }, "execution_count": 268, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 왼쪽 우선 외부조인은 왼쪽의 모든 로우를 포함하는 결과를 반환한다\n", "pd.merge(df1, df2, on='key', how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 오른쪽 우선 외부조인(right)" ] }, { "cell_type": "code", "execution_count": 269, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1keydata2
0 0 b 1
1 1 b 1
2 5 b 1
3 0 b 3
4 1 b 3
5 5 b 3
6 2 a 0
7 4 a 0
8 2 a 2
9 4 a 2
10NaN d 4
\n", "
" ], "text/plain": [ " data1 key data2\n", "0 0 b 1\n", "1 1 b 1\n", "2 5 b 1\n", "3 0 b 3\n", "4 1 b 3\n", "5 5 b 3\n", "6 2 a 0\n", "7 4 a 0\n", "8 2 a 2\n", "9 4 a 2\n", "10 NaN d 4" ] }, "execution_count": 269, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 오른쪽 우선 외부조인은 오른쪽의 모든 로우를 포함하는 결과를 반환한다\n", "pd.merge(df1, df2, on='key', how='right')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 색인 merge" ] }, { "cell_type": "code", "execution_count": 273, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvalue
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
\n", "
" ], "text/plain": [ " key value\n", "0 a 0\n", "1 b 1\n", "2 a 2\n", "3 a 3\n", "4 b 4\n", "5 c 5" ] }, "execution_count": 273, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left1 = DataFrame({'key' : ['a', 'b', 'a', 'a', 'b', 'c'],\n", " 'value' : range(6)})\n", "left1" ] }, { "cell_type": "code", "execution_count": 274, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_val
a 3.5
b 7.0
\n", "
" ], "text/plain": [ " group_val\n", "a 3.5\n", "b 7.0" ] }, "execution_count": 274, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right1 = DataFrame({'group_val' : [3.5, 7]}, index=['a', 'b'])\n", "right1" ] }, { "cell_type": "code", "execution_count": 275, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvaluegroup_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
\n", "
" ], "text/plain": [ " key value group_val\n", "0 a 0 3.5\n", "2 a 2 3.5\n", "3 a 3 3.5\n", "1 b 1 7.0\n", "4 b 4 7.0" ] }, "execution_count": 275, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 기본은 교집합\n", "pd.merge(left1, right1, left_on='key', right_index=True)" ] }, { "cell_type": "code", "execution_count": 276, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvaluegroup_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
\n", "
" ], "text/plain": [ " key value group_val\n", "0 a 0 3.5\n", "2 a 2 3.5\n", "3 a 3 3.5\n", "1 b 1 7.0\n", "4 b 4 7.0\n", "5 c 5 NaN" ] }, "execution_count": 276, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 외부조인\n", "pd.merge(left1, right1, left_on='key', right_index=True, how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### concat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "하나의 축을 따라 객체를 이어붙인다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* numpy\n", " - concatenate 함수\n", "* pandas\n", " * concat 함수\n", " * Series\n", " * DataFrame\n", " * axis 인자\n", " * join_axes 인자\n", " * keys 인자\n", " * ignore_index 인자" ] }, { "cell_type": "code", "execution_count": 279, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3],\n", " [ 4, 5, 6, 7],\n", " [ 8, 9, 10, 11]])" ] }, "execution_count": 279, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# numpy의 concatenate 함수\n", "arr = np.arange(12).reshape((3, 4))\n", "arr" ] }, { "cell_type": "code", "execution_count": 280, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 0, 1, 2, 3],\n", " [ 4, 5, 6, 7, 4, 5, 6, 7],\n", " [ 8, 9, 10, 11, 8, 9, 10, 11]])" ] }, "execution_count": 280, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.concatenate([arr, arr], axis=1)" ] }, { "cell_type": "code", "execution_count": 284, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# pandas의 concatenate 함수\n", "\n", "# Series 객체\n", "s1 = Series([0, 1], index=['a', 'b'])\n", "s2 = Series([2, 3, 4], index=['c', 'd', 'e'])\n", "s3 = Series([5, 6], index=['f', 'g'])" ] }, { "cell_type": "code", "execution_count": 283, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 1\n", "dtype: int64" ] }, "execution_count": 283, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": 286, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 2\n", "d 3\n", "e 4\n", "dtype: int64" ] }, "execution_count": 286, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2" ] }, { "cell_type": "code", "execution_count": 287, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "f 5\n", "g 6\n", "dtype: int64" ] }, "execution_count": 287, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3" ] }, { "cell_type": "code", "execution_count": 288, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 1\n", "c 2\n", "d 3\n", "e 4\n", "f 5\n", "g 6\n", "dtype: int64" ] }, "execution_count": 288, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 합치기\n", "pd.concat([s1, s2, s3])" ] }, { "cell_type": "code", "execution_count": 282, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
a 0NaNNaN
b 1NaNNaN
cNaN 2NaN
dNaN 3NaN
eNaN 4NaN
fNaNNaN 5
gNaNNaN 6
\n", "
" ], "text/plain": [ " 0 1 2\n", "a 0 NaN NaN\n", "b 1 NaN NaN\n", "c NaN 2 NaN\n", "d NaN 3 NaN\n", "e NaN 4 NaN\n", "f NaN NaN 5\n", "g NaN NaN 6" ] }, "execution_count": 282, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 축 바꿔서\n", "pd.concat([s1, s2, s3], axis=1)" ] }, { "cell_type": "code", "execution_count": 290, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 5\n", "f 5\n", "g 6\n", "dtype: int64" ] }, "execution_count": 290, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s4 = pd.concat([s1*5, s3])\n", "s4" ] }, { "cell_type": "code", "execution_count": 291, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
a 0 0
cNaNNaN
b 1 5
eNaNNaN
\n", "
" ], "text/plain": [ " 0 1\n", "a 0 0\n", "c NaN NaN\n", "b 1 5\n", "e NaN NaN" ] }, "execution_count": 291, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])" ] }, { "cell_type": "code", "execution_count": 292, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# DataFrame도 비슷하게 작동한다\n", "\n", "df1 = DataFrame(np.arange(6).reshape(3, 2), index = ['a', 'b', 'c'],\n", " columns = ['one', 'two'])\n", "\n", "df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index = ['a', 'c'],\n", " columns = ['three', 'four'])" ] }, { "cell_type": "code", "execution_count": 293, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a 0 1
b 2 3
c 4 5
\n", "
" ], "text/plain": [ " one two\n", "a 0 1\n", "b 2 3\n", "c 4 5" ] }, "execution_count": 293, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
threefour
a 5 6
c 7 8
\n", "
" ], "text/plain": [ " three four\n", "a 5 6\n", "c 7 8" ] }, "execution_count": 294, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 295, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fouronethreetwo
aNaN 0NaN 1
bNaN 2NaN 3
cNaN 4NaN 5
a 6NaN 5NaN
c 8NaN 7NaN
\n", "
" ], "text/plain": [ " four one three two\n", "a NaN 0 NaN 1\n", "b NaN 2 NaN 3\n", "c NaN 4 NaN 5\n", "a 6 NaN 5 NaN\n", "c 8 NaN 7 NaN" ] }, "execution_count": 295, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2])" ] }, { "cell_type": "code", "execution_count": 296, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefour
a 0 1 5 6
b 2 3NaNNaN
c 4 5 7 8
\n", "
" ], "text/plain": [ " one two three four\n", "a 0 1 5 6\n", "b 2 3 NaN NaN\n", "c 4 5 7 8" ] }, "execution_count": 296, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2], axis=1)" ] }, { "cell_type": "code", "execution_count": 297, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
level1level2
onetwothreefour
a 0 1 5 6
b 2 3NaNNaN
c 4 5 7 8
\n", "
" ], "text/plain": [ " level1 level2 \n", " one two three four\n", "a 0 1 5 6\n", "b 2 3 NaN NaN\n", "c 4 5 7 8" ] }, "execution_count": 297, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### combine_first" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "두 객체를 포개서 한 객체에서 누락된 데이터를 다른 객체에 있는 값으로 채울 수 있게 한다." ] }, { "cell_type": "code", "execution_count": 298, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "f NaN\n", "e 2.5\n", "d NaN\n", "c 3.5\n", "b 4.5\n", "a NaN\n", "dtype: float64" ] }, "execution_count": 298, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd', 'c', 'b', 'a'])\n", "a\n" ] }, { "cell_type": "code", "execution_count": 299, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "f 0\n", "e 1\n", "d 2\n", "c 3\n", "b 4\n", "a NaN\n", "dtype: float64" ] }, "execution_count": 299, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'c', 'b', 'a'])\n", "b[-1] = np.nan\n", "b" ] }, { "cell_type": "code", "execution_count": 303, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 2.5, 2. , 3.5, 4.5, nan])" ] }, "execution_count": 303, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 두 자료의 색인이 겹친다. 이러면(색인이 일부겹치거나 완전히 같거나) 머지나 이어붙이기로는 불가능\n", "# 이때 조건절로 합침\n", "np.where(pd.isnull(a), b, a)" ] }, { "cell_type": "code", "execution_count": 304, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a NaN\n", "b 4.5\n", "c 3.0\n", "d 2.0\n", "e 1.0\n", "f 0.0\n", "dtype: float64" ] }, "execution_count": 304, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 위의 것과 동일하다.\n", "b[:-2].combine_first(a[2:])" ] }, { "cell_type": "code", "execution_count": 300, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d NaN\n", "c 3.5\n", "b 4.5\n", "a NaN\n", "dtype: float64" ] }, "execution_count": 300, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[2:]" ] }, { "cell_type": "code", "execution_count": 301, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "f 0\n", "e 1\n", "d 2\n", "c 3\n", "dtype: float64" ] }, "execution_count": 301, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[:-2]" ] }, { "cell_type": "code", "execution_count": 305, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1NaN 2
1NaN 2 6
2 5NaN 10
3NaN 6 14
\n", "
" ], "text/plain": [ " a b c\n", "0 1 NaN 2\n", "1 NaN 2 6\n", "2 5 NaN 10\n", "3 NaN 6 14" ] }, "execution_count": 305, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DataFrame은 컬럼에 대해 위와 같이 동작\n", "df1 = DataFrame({'a': [1., np.nan, 5., np.nan],\n", " 'b': [np.nan, 2., np.nan, 6.],\n", " 'c': range(2, 18, 4)})\n", "df1" ] }, { "cell_type": "code", "execution_count": 306, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0 5NaN
1 4 3
2NaN 4
3 3 6
4 7 8
\n", "
" ], "text/plain": [ " a b\n", "0 5 NaN\n", "1 4 3\n", "2 NaN 4\n", "3 3 6\n", "4 7 8" ] }, "execution_count": 306, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = DataFrame({'a': [5., 4., np.nan, 3., 7],\n", " 'b': [np.nan, 3., 4., 6., 8.]})\n", "df2" ] }, { "cell_type": "code", "execution_count": 307, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0 1NaN 2
1 4 2 6
2 5 4 10
3 3 6 14
4 7 8NaN
\n", "
" ], "text/plain": [ " a b c\n", "0 1 NaN 2\n", "1 4 2 6\n", "2 5 4 10\n", "3 3 6 14\n", "4 7 8 NaN" ] }, "execution_count": 307, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.combine_first(df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 재형성과 피벗" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "표 형식의 데이터를 재배치하는 다양한 기본 연산이 존재하는데, 이런 연산을 재형성reshaping 또는 피벗이라고 한다.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 계층적 색인으로 재형성\n", "* 피버팅" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 계층적 색인으로 재형성" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* stack : 데이터의 컬럼을 로우로 피벗 또는 회전\n", "* unstack : 로우를 칼럼으로 피벗시킨다" ] }, { "cell_type": "code", "execution_count": 309, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numberonetwothree
state
Ohio 0 1 2
Colorado 3 4 5
\n", "
" ], "text/plain": [ "number one two three\n", "state \n", "Ohio 0 1 2\n", "Colorado 3 4 5" ] }, "execution_count": 309, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = DataFrame(np.arange(6).reshape((2, 3)),\n", " index=pd.Index(['Ohio', 'Colorado'], name='state'),\n", " columns = pd.Index(['one', 'two', 'three'], name='number'))\n", "data" ] }, { "cell_type": "code", "execution_count": 310, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "state number\n", "Ohio one 0\n", " two 1\n", " three 2\n", "Colorado one 3\n", " two 4\n", " three 5\n", "dtype: int64" ] }, "execution_count": 310, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# long 포맷으로 변형\n", "result = data.stack()\n", "result" ] }, { "cell_type": "code", "execution_count": 311, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numberonetwothree
state
Ohio 0 1 2
Colorado 3 4 5
\n", "
" ], "text/plain": [ "number one two three\n", "state \n", "Ohio 0 1 2\n", "Colorado 3 4 5" ] }, "execution_count": 311, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# wide 포맷으로 변형\n", "result.unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 피버팅" ] }, { "cell_type": "code", "execution_count": 337, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ldata = DataFrame({'date': np.random.rand(9),\n", " 'item':['a','a','c', 'b', 'a', 'b', 'c', 'a', 'a'],\n", " 'value': np.arange(9)}) " ] }, { "cell_type": "code", "execution_count": 338, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateitemvalue
0 0.091794 a 0
1 0.524387 a 1
2 0.412385 c 2
3 0.247175 b 3
4 0.731769 a 4
5 0.286297 b 5
6 0.881872 c 6
7 0.404667 a 7
8 0.383669 a 8
\n", "
" ], "text/plain": [ " date item value\n", "0 0.091794 a 0\n", "1 0.524387 a 1\n", "2 0.412385 c 2\n", "3 0.247175 b 3\n", "4 0.731769 a 4\n", "5 0.286297 b 5\n", "6 0.881872 c 6\n", "7 0.404667 a 7\n", "8 0.383669 a 8" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ldata" ] }, { "cell_type": "code", "execution_count": 339, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemabc
date
0.091794 0NaNNaN
0.247175NaN 3NaN
0.286297NaN 5NaN
0.383669 8NaNNaN
0.404667 7NaNNaN
\n", "
" ], "text/plain": [ "item a b c\n", "date \n", "0.091794 0 NaN NaN\n", "0.247175 NaN 3 NaN\n", "0.286297 NaN 5 NaN\n", "0.383669 8 NaN NaN\n", "0.404667 7 NaN NaN" ] }, "execution_count": 339, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivoted = ldata.pivot('date', 'item', 'value')\n", "pivoted.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 데이터 변형" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 중복제거" ] }, { "cell_type": "code", "execution_count": 343, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 one 1\n", "2 one 2\n", "3 two 3\n", "4 two 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 343, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = DataFrame({'k1': ['one']*3 + ['two']*4,\n", " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n", "data" ] }, { "cell_type": "code", "execution_count": 345, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 False\n", "4 True\n", "5 False\n", "6 True\n", "dtype: bool" ] }, "execution_count": 345, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 중복찾기\n", "data.duplicated()" ] }, { "cell_type": "code", "execution_count": 347, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0 one 1
2 one 2
3 two 3
5 two 4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "2 one 2\n", "3 two 3\n", "5 two 4" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 중복제거\n", "data.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 함수 매핑" ] }, { "cell_type": "code", "execution_count": 348, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
\n", "
" ], "text/plain": [ " food ounces\n", "0 bacon 4.0\n", "1 pulled pork 3.0\n", "2 bacon 12.0\n", "3 Pastrami 6.0\n", "4 corned beef 7.5\n", "5 Bacon 8.0\n", "6 pastrami 3.0\n", "7 honey ham 5.0\n", "8 nova lox 6.0" ] }, "execution_count": 348, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = DataFrame({'food' : ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],\n", " 'ounces' : [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n", "data" ] }, { "cell_type": "code", "execution_count": 355, "metadata": { "collapsed": false }, "outputs": [], "source": [ "meat_to_animal = {\n", " 'bacon' : 'pig',\n", " 'pulled pork' : 'pig',\n", " 'pastrami' : 'cow',\n", " 'corned beef' : 'cow',\n", " 'honey ham' : 'pig',\n", " 'nova lox' : 'salmon'\n", "}" ] }, { "cell_type": "code", "execution_count": 356, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodouncesanimal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
\n", "
" ], "text/plain": [ " food ounces animal\n", "0 bacon 4.0 pig\n", "1 pulled pork 3.0 pig\n", "2 bacon 12.0 pig\n", "3 Pastrami 6.0 cow\n", "4 corned beef 7.5 cow\n", "5 Bacon 8.0 pig\n", "6 pastrami 3.0 cow\n", "7 honey ham 5.0 pig\n", "8 nova lox 6.0 salmon" ] }, "execution_count": 356, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 육류의 이름을 소문자로 바꾸고, 해당 동물로 매핑\n", "data['animal'] = data['food'].map(str.lower).map(meat_to_animal)\n", "data" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 pig\n", "1 pig\n", "2 pig\n", "3 cow\n", "4 cow\n", "5 pig\n", "6 cow\n", "7 pig\n", "8 salmon\n", "Name: food, dtype: object" ] }, "execution_count": 357, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 함수 형태로 넘길수도 있다.\n", "data['food'].map(lambda x: meat_to_animal[x.lower()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 값 치환" ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 -999\n", "2 2\n", "3 -999\n", "4 -1000\n", "5 3\n", "dtype: float64" ] }, "execution_count": 358, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = Series([1., -999., 2., -999., -1000., 3.])\n", "data" ] }, { "cell_type": "code", "execution_count": 359, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 -999\n", "2 2\n", "3 -999\n", "4 -1000\n", "5 3\n", "dtype: float64" ] }, "execution_count": 359, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.replace(-999, np.nan)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 특이값 찾아내고 치환하기" ] }, { "cell_type": "code", "execution_count": 360, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean -0.067684 0.067924 0.025598 -0.002298\n", "std 0.998035 0.992106 1.006835 0.996794\n", "min -3.428254 -3.548824 -3.184377 -3.745356\n", "25% -0.774890 -0.591841 -0.641675 -0.644144\n", "50% -0.116401 0.101143 0.002073 -0.013611\n", "75% 0.616366 0.780282 0.680391 0.654328\n", "max 3.366626 2.653656 3.260383 3.927528" ] }, "execution_count": 360, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(12345)\n", "\n", "data = DataFrame(np.random.randn(1000, 4))\n", "data.describe()" ] }, { "cell_type": "code", "execution_count": 361, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 -0.555730\n", "1 0.281746\n", "2 -1.296221\n", "3 0.886429\n", "4 -0.438570\n", "Name: 3, dtype: float64" ] }, "execution_count": 361, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 위 자료에서 한 컬럼에서 절대값이 3을 초과하는 값 찾기\n", "col = data[3]\n", "\n", "col.head()" ] }, { "cell_type": "code", "execution_count": 363, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "97 3.927528\n", "305 -3.399312\n", "400 -3.745356\n", "Name: 3, dtype: float64" ] }, "execution_count": 363, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col[np.abs(col) > 3]" ] }, { "cell_type": "code", "execution_count": 364, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
5 -0.539741 0.476985 3.248944-1.021228
97 -0.774363 0.552936 0.106061 3.927528
102-0.655054-0.565230 3.176873 0.959533
305-2.315555 0.457246-0.025907-3.399312
324 0.050188 1.951312 3.260383 0.963301
400 0.146326 0.508391-0.196713-3.745356
499-0.293333-0.242459-3.056990 1.918403
523-3.428254-0.296336-0.439938-0.867165
586 0.275144 1.179227-3.184377 1.369891
808-0.362528-3.548824 1.553205-2.186301
900 3.366626-2.372214 0.851010 1.332846
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "5 -0.539741 0.476985 3.248944 -1.021228\n", "97 -0.774363 0.552936 0.106061 3.927528\n", "102 -0.655054 -0.565230 3.176873 0.959533\n", "305 -2.315555 0.457246 -0.025907 -3.399312\n", "324 0.050188 1.951312 3.260383 0.963301\n", "400 0.146326 0.508391 -0.196713 -3.745356\n", "499 -0.293333 -0.242459 -3.056990 1.918403\n", "523 -3.428254 -0.296336 -0.439938 -0.867165\n", "586 0.275144 1.179227 -3.184377 1.369891\n", "808 -0.362528 -3.548824 1.553205 -2.186301\n", "900 3.366626 -2.372214 0.851010 1.332846" ] }, "execution_count": 364, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[(np.abs(data)>3).any(1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 데이터 그룹 연산" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "그룹 연산 = 분리-적용-결합" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 365, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2key1key2
0 1.150765 1.199915 a one
1-0.997174-0.451814 a two
2 0.046486-0.155385 b one
3-0.610441-0.153514 b two
4-0.394982 0.011194 a one
\n", "
" ], "text/plain": [ " data1 data2 key1 key2\n", "0 1.150765 1.199915 a one\n", "1 -0.997174 -0.451814 a two\n", "2 0.046486 -0.155385 b one\n", "3 -0.610441 -0.153514 b two\n", "4 -0.394982 0.011194 a one" ] }, "execution_count": 365, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],\n", " 'key2' : ['one', 'two', 'one', 'two', 'one'],\n", " 'data1' : np.random.randn(5),\n", " 'data2' : np.random.randn(5)})\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 366, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 366, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 데이터를 key1으로 묶고 각 그룹에서 data1의 평균을 구하는 방법\n", "grouped = df['data1'].groupby(df['key1'])\n", "\n", "# grouped 변수는 Groupby 객체\n", "grouped" ] }, { "cell_type": "code", "execution_count": 367, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "key1\n", "a -0.080463\n", "b -0.281977\n", "Name: data1, dtype: float64" ] }, "execution_count": 367, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 평균 구하기\n", "grouped.mean()" ] }, { "cell_type": "code", "execution_count": 368, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "key1 key2\n", "a one 0.377892\n", " two -0.997174\n", "b one 0.046486\n", " two -0.610441\n", "Name: data1, dtype: float64" ] }, "execution_count": 368, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 색인 여러개 넘기기\n", "means = df['data1'].groupby([df['key1'],df['key2']]).mean()\n", "\n", "means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 참고자료 \n", "\n", " * [1] 파이썬 라이브러리를 활용한 데이터 분석 - http://www.hanbit.co.kr/book/look.html?isbn=978-89-6848-047-8\n", " * [2] 버거지수 - http://nbviewer.ipython.org/gist/hyeshik/cf9f3d7686e07eedbfda?revision=6\n", " * [3] 데이터/수치 분석을 위한 파이썬 라이브러리 SciPy와 NumPy - http://www.hanbit.co.kr/ebook/look.html?isbn=9788968486135" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }