{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### 따라 하며 배우는 데이터 과학 - 파이썬 편 2 - 데이터 취득과 가공\n", "\n", "본 장은 [\"따라 하며 배우는 데이터 과학\"](https://dataninja.me/ipds-kr/) 3장의 \n", "\"데이터 취득과 데이터 가공: SQL과 dplyr\" 내용의 파이썬 버전입니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Sources:\n", "- `data/gapminder.tsv` : https://raw.githubusercontent.com/jennybc/gapminder/master/data-raw/08_gap-every-five-years.tsv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. gapminder 자료 읽어들이기" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# gapminder 자료를 다운로드하고 판다스 데이터프레임으로 읽어들이자\n", "gapminder = pd.read_csv(\"data/gapminder.tsv\", sep=\"\\t\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
0AfghanistanAsia195228.8018425333779.445314
1AfghanistanAsia195730.3329240934820.853030
2AfghanistanAsia196231.99710267083853.100710
3AfghanistanAsia196734.02011537966836.197138
4AfghanistanAsia197236.08813079460739.981106
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030\n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710\n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138\n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
1699ZimbabweAfrica198762.3519216418706.157306
1700ZimbabweAfrica199260.37710704340693.420786
1701ZimbabweAfrica199746.80911404948792.449960
1702ZimbabweAfrica200239.98911926563672.038623
1703ZimbabweAfrica200743.48712311143469.709298
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "1699 Zimbabwe Africa 1987 62.351 9216418 706.157306\n", "1700 Zimbabwe Africa 1992 60.377 10704340 693.420786\n", "1701 Zimbabwe Africa 1997 46.809 11404948 792.449960\n", "1702 Zimbabwe Africa 2002 39.989 11926563 672.038623\n", "1703 Zimbabwe Africa 2007 43.487 12311143 469.709298" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.tail()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1704 entries, 0 to 1703\n", "Data columns (total 6 columns):\n", "country 1704 non-null object\n", "continent 1704 non-null object\n", "year 1704 non-null int64\n", "lifeExp 1704 non-null float64\n", "pop 1704 non-null int64\n", "gdpPercap 1704 non-null float64\n", "dtypes: float64(2), int64(2), object(2)\n", "memory usage: 79.9+ KB\n" ] } ], "source": [ "gapminder.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlifeExppopgdpPercap
count1704.000001704.0000001.704000e+031704.000000
mean1979.5000059.4744392.960121e+077215.327081
std17.2653312.9171071.061579e+089857.454543
min1952.0000023.5990006.001100e+04241.165877
25%1965.7500048.1980002.793664e+061202.060309
50%1979.5000060.7125007.023596e+063531.846989
75%1993.2500070.8455001.958522e+079325.462346
max2007.0000082.6030001.318683e+09113523.132900
\n", "
" ], "text/plain": [ " year lifeExp pop gdpPercap\n", "count 1704.00000 1704.000000 1.704000e+03 1704.000000\n", "mean 1979.50000 59.474439 2.960121e+07 7215.327081\n", "std 17.26533 12.917107 1.061579e+08 9857.454543\n", "min 1952.00000 23.599000 6.001100e+04 241.165877\n", "25% 1965.75000 48.198000 2.793664e+06 1202.060309\n", "50% 1979.50000 60.712500 7.023596e+06 3531.846989\n", "75% 1993.25000 70.845500 1.958522e+07 9325.462346\n", "max 2007.00000 82.603000 1.318683e+09 113523.132900" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 데이터를 처리하는 핵심 동사\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. 행을 선택하기" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
851Korea, Rep.Asia200778.6234904479023348.13973
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "851 Korea, Rep. Asia 2007 78.623 49044790 23348.13973" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# gapminder 데이터에서 한국 데이터, 2007년 데이터, 한국 2007년 데이터를 추출하는 명령은 다음과 같다.\n", "# R dplyr 에서는 filter(gapminder, country=='Korea, Rep.' & year==2007)\n", "gapminder.query(\"country=='Korea, Rep.' & year==2007\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 행(관측치)를 정렬하기" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
0AfghanistanAsia195228.8018425333779.445314
12AlbaniaEurope195255.23012826971601.056136
24AlgeriaAfrica195243.07792795252449.008185
36AngolaAfrica195230.01542320953520.610273
48ArgentinaAmericas195262.485178769565911.315053
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "12 Albania Europe 1952 55.230 1282697 1601.056136\n", "24 Algeria Africa 1952 43.077 9279525 2449.008185\n", "36 Angola Africa 1952 30.015 4232095 3520.610273\n", "48 Argentina Americas 1952 62.485 17876956 5911.315053" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# gapminder 데이터를 year, country 변수순으로 정렬하려면,\n", "# R dplyr 에서는 gapminder %>% arrange(year, country)\n", "gapminder.sort_values(by=[\"year\", \"country\"]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. 열(변수)를 선택하기" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
popgdpPercap
08425333779.445314
19240934820.853030
210267083853.100710
311537966836.197138
413079460739.981106
\n", "
" ], "text/plain": [ " pop gdpPercap\n", "0 8425333 779.445314\n", "1 9240934 820.853030\n", "2 10267083 853.100710\n", "3 11537966 836.197138\n", "4 13079460 739.981106" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# gapminder 데이터에서 pop, gdpPercap 변수만 선택.\n", "# R dplyr 에서는 gapminder %>% select(pop, gdpPercap)\n", "gapminder[[\"pop\", \"gdpPercap\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. 변수 변환하기" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercaptotal_gdple_gdp_ratiolgrk
0AfghanistanAsia195228.8018425333779.4453146.567086e+090.0369513.695064
1AfghanistanAsia195730.3329240934820.8530307.585449e+090.0369523.695180
2AfghanistanAsia196231.99710267083853.1007108.758856e+090.0375073.750671
3AfghanistanAsia196734.02011537966836.1971389.648014e+090.0406844.068419
4AfghanistanAsia197236.08813079460739.9811069.678553e+090.0487694.876881
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap total_gdp \\\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314 6.567086e+09 \n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030 7.585449e+09 \n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710 8.758856e+09 \n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138 9.648014e+09 \n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106 9.678553e+09 \n", "\n", " le_gdp_ratio lgrk \n", "0 0.036951 3.695064 \n", "1 0.036952 3.695180 \n", "2 0.037507 3.750671 \n", "3 0.040684 4.068419 \n", "4 0.048769 4.876881 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# gapminder 데이터에서 기존의 변수들을 변환한 결과를 기존 변수나 새 변수에 할당한다. \n", "# R dplyr 에서는 \n", "# gapminder %>%\n", "# mutate(total_gdp = pop * gdpPercap,\n", "# le_gdp_ratio = lifeExp / gdpPercap, lgrk = le_gdp_ratio * 100)\n", "# 1. 파이썬에서는 각 변수 할당에 새로운 assign() 함수를 사용해야 한다.\n", "# 2. x.pop 은 내부의 pop() 함수와 충돌을 일으키므로 x['pop']으로 표현했다.\n", "gapminder.\\\n", " assign(total_gdp = lambda x: (x['pop'] * x['gdpPercap'])).\\\n", " assign(le_gdp_ratio = lambda x: (x['lifeExp'] / x['gdpPercap'])).\\\n", " assign(lgrk = lambda x: x['le_gdp_ratio'] * 100).\\\n", " head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. 요약 통계량 계산하기" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlifeExppopgdpPercap
mean1979.559.4744392.960121e+077215.327081
median1979.560.7125007.023596e+063531.846989
\n", "
" ], "text/plain": [ " year lifeExp pop gdpPercap\n", "mean 1979.5 59.474439 2.960121e+07 7215.327081\n", "median 1979.5 60.712500 7.023596e+06 3531.846989" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# gapminder 데이터에서 기존의 변수들을 변환한 결과를 기존 변수나 새 변수에 할당한다. \n", "# R dplyr 에서는 \n", "# gapminder %>% summarize(n_obs = n( ),\n", "# n_countries = n_distinct(country),\n", "# n_years = n_distinct(year),\n", "# med_gdpc = median(gdpPercap),\n", "# max_gdppc = max(gdpPercap))\n", "gapminder.aggregate(['mean', 'median'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. 랜덤 샘플을 위한 sample()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
1522TanzaniaAfrica200249.65134593779899.074211
1553Trinidad and TobagoAmericas197768.30010390097899.554209
238CameroonAfrica200249.856159299881934.011449
1085NetherlandsEurope197775.2401385298921209.059200
634Guinea-BissauAfrica200245.5041332459575.704718
1569TunisiaAfrica199771.97392316694876.798614
648HondurasAmericas195241.91215174532194.926204
1072NepalAsia197243.97112412593674.788130
1617United StatesAmericas199776.81027291176035767.433030
953MaliAfrica197741.7146491649686.395269
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "1522 Tanzania Africa 2002 49.651 34593779 899.074211\n", "1553 Trinidad and Tobago Americas 1977 68.300 1039009 7899.554209\n", "238 Cameroon Africa 2002 49.856 15929988 1934.011449\n", "1085 Netherlands Europe 1977 75.240 13852989 21209.059200\n", "634 Guinea-Bissau Africa 2002 45.504 1332459 575.704718\n", "1569 Tunisia Africa 1997 71.973 9231669 4876.798614\n", "648 Honduras Americas 1952 41.912 1517453 2194.926204\n", "1072 Nepal Asia 1972 43.971 12412593 674.788130\n", "1617 United States Americas 1997 76.810 272911760 35767.433030\n", "953 Mali Africa 1977 41.714 6491649 686.395269" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(12345)\n", "gapminder.sample(n=10)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(17, 6)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.sample(frac=.01).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. 고유한 행을 찾아내는 distinct( )" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',\n", " 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium', 'Benin',\n", " 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',\n", " 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',\n", " 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',\n", " 'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',\n", " 'Costa Rica', \"Cote d'Ivoire\", 'Croatia', 'Cuba', 'Czech Republic',\n", " 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',\n", " 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',\n", " 'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',\n", " 'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',\n", " 'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',\n", " 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',\n", " 'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',\n", " 'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',\n", " 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',\n", " 'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',\n", " 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',\n", " 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',\n", " 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',\n", " 'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',\n", " 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',\n", " 'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',\n", " 'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',\n", " 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tanzania',\n", " 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey',\n", " 'Uganda', 'United Kingdom', 'United States', 'Uruguay', 'Venezuela',\n", " 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'], dtype=object),\n", " array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,\n", " 2007]))" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# R dplyr 는\n", "# gapminder %>% select(country) %>% distinct()\n", "# gapminder %>% select(year) %>% distinct()\n", "\n", "gapminder.country.unique(), gapminder.year.unique()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
0AfghanistanAsia195228.8018425333779.445314
1AfghanistanAsia195730.3329240934820.853030
2AfghanistanAsia196231.99710267083853.100710
3AfghanistanAsia196734.02011537966836.197138
4AfghanistanAsia197236.08813079460739.981106
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030\n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710\n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138\n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.drop_duplicates(['country', 'year']).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# group_by() 를 이용한 그룹 연산" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lifeExp
continent
Africa52.9265
Americas72.8990
Asia72.3960
Europe78.6085
Oceania80.7195
\n", "
" ], "text/plain": [ " lifeExp\n", "continent \n", "Africa 52.9265\n", "Americas 72.8990\n", "Asia 72.3960\n", "Europe 78.6085\n", "Oceania 80.7195" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# R dplyr 는\n", "# gapminder %>%\n", "# filter(year == 2007) %>% \n", "# group_by(continent) %>% \n", "# summarize(median(lifeExp))\n", "\n", "gapminder.\\\n", " query('year == 2007').\\\n", " groupby('continent').\\\n", " agg({'lifeExp':'median'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 조인 연산자; inner, left, right, full(outer) join\n", "R dplyr 예는 다음과 같다.\n", "```\n", "(df1 <- data_frame(x = c(1, 2), y = 2:1))\n", "(df2 <- data_frame(x = c(1, 3), a = 10, b = \"a\"))\n", "df1 %>% inner_join(df2)\n", "df1 %>% left_join(df2)\n", "df1 %>% right_join(df2)\n", "df1 %>% full_join(df2)\n", "```\n", "파이썬 판다스에서는 `DataFrame.merge` 함수로 처리하면 된다." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
002
111
\n", "
" ], "text/plain": [ " x y\n", "0 0 2\n", "1 1 1" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame(data={'x':range(2), 'y':range(2, 0, -1)})\n", "df1" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abx
010a1
110a3
\n", "
" ], "text/plain": [ " a b x\n", "0 10 a 1\n", "1 10 a 3" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame(data={'x':[1,3], 'a':10, 'b':\"a\"})\n", "df2" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyab
01110a
\n", "
" ], "text/plain": [ " x y a b\n", "0 1 1 10 a" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how=\"inner\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyab
002NaNNaN
11110.0a
\n", "
" ], "text/plain": [ " x y a b\n", "0 0 2 NaN NaN\n", "1 1 1 10.0 a" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how=\"left\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyab
01110a
\n", "
" ], "text/plain": [ " x y a b\n", "0 1 1 10 a" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how=\"inner\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyab
011.010a
13NaN10a
\n", "
" ], "text/plain": [ " x y a b\n", "0 1 1.0 10 a\n", "1 3 NaN 10 a" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how=\"right\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyab
002.0NaNNaN
111.010.0a
23NaN10.0a
\n", "
" ], "text/plain": [ " x y a b\n", "0 0 2.0 NaN NaN\n", "1 1 1.0 10.0 a\n", "2 3 NaN 10.0 a" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.merge(df2, how=\"outer\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }