{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 9b: Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Version 1.1. Prepared by [Makzan](https://makzan.net). Updated at 2021 March." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas stands for Python Data Analysis Library. It makes use of Series (Like a Numpy Array) and DataFrame (tabular data). In this lesson, we will learn the basic data processing, calculation and filtering by using Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Is student passed?](#Is-student-passed?)\n", "- [Accessing the column](#Accessing-the-column)\n", "- [Reading Excel](#Reading-Excel)\n", "- [Reading CSV](#Reading-CSV)\n", "- [Fetching CSV from Wikipedia](#Fetching-CSV-from-Wikipedia)\n", "- [Lab 4: Olymnic Games medal table](#Lab-4:-All-time-Olympic-Games-medal-table)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Is student passed?\n", "\n", "Remember the `pass_or_fail` function we defined in lesson 2? Now we can further process this example with Pandas and its DataFrame." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Apply label to data\n", "def pass_or_fail(x):\n", " if x >= 60:\n", " return True\n", " return False" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Prepare the data\n", "data = [50, 60, 20, 50, 88, 45, 63, 59]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the data list, we can convert it into DataFrame" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Score
050
160
220
350
488
545
663
759
\n", "
" ], "text/plain": [ " Score\n", "0 50\n", "1 60\n", "2 20\n", "3 50\n", "4 88\n", "5 45\n", "6 63\n", "7 59" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Where is the pass and fail?\n", "df = pd.DataFrame(data, columns=['Score'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create extra column to indicate if the score is passed." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ScoreIs Passed
050False
160True
220False
350False
488True
545False
663True
759False
\n", "
" ], "text/plain": [ " Score Is Passed\n", "0 50 False\n", "1 60 True\n", "2 20 False\n", "3 50 False\n", "4 88 True\n", "5 45 False\n", "6 63 True\n", "7 59 False" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Is Passed'] = df['Score'].apply(pass_or_fail)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using Boolean filtering, we can list only the data that meet the condition." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ScoreIs Passed
160True
663True
\n", "
" ], "text/plain": [ " Score Is Passed\n", "1 60 True\n", "6 63 True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[ (df['Score']>=60) & (df['Score']<=80) ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will load the students dictionary from lesson 2." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "students = [\n", " {'name': 'Thomas', 'score': 65},\n", " {'name': 'Alan', 'score': 95},\n", " {'name': 'Jane', 'score': 85},\n", " {'name': 'Susan', 'score': 75},\n", " {'name': 'Chris', 'score': 45}\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We convert the dictionary into DataFrame. Pandas will take care of the column name automatically." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescore
0Thomas65
1Alan95
2Jane85
3Susan75
4Chris45
\n", "
" ], "text/plain": [ " name score\n", "0 Thomas 65\n", "1 Alan 95\n", "2 Jane 85\n", "3 Susan 75\n", "4 Chris 45" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(students)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same as the example above, we apply the `pass_or_fail` function to create a new column with Boolean." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescoreIs Passed
0Thomas65True
1Alan95True
2Jane85True
3Susan75True
4Chris45False
\n", "
" ], "text/plain": [ " name score Is Passed\n", "0 Thomas 65 True\n", "1 Alan 95 True\n", "2 Jane 85 True\n", "3 Susan 75 True\n", "4 Chris 45 False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Is Passed'] = df['score'].apply(pass_or_fail)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can generate a new DataFrame with only the passed records." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescoreIs Passed
0Thomas65True
1Alan95True
2Jane85True
3Susan75True
\n", "
" ], "text/plain": [ " name score Is Passed\n", "0 Thomas 65 True\n", "1 Alan 95 True\n", "2 Jane 85 True\n", "3 Susan 75 True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_passed = df[df[\"Is Passed\"]]\n", "df_passed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save the processed tabular data into Excel by using `to_excel`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "\n", "df_passed.to_excel(\"Students Passed.xlsx\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing the column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get the name of students who passed." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 Thomas\n", "1 Alan\n", "2 Jane\n", "3 Susan\n", "Name: name, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Is Passed\"]][\"name\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example of outputing the result." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "We have total 5 students, 4 of them passed.\n" ] } ], "source": [ "count_of_passed = len(df[df[\"Is Passed\"]])\n", "result = f\"We have total {len(df)} students, {count_of_passed} of them passed.\"\n", "print(result)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "They are Thomas, Alan, Jane, Susan.\n" ] } ], "source": [ "names = ', '.join(df[df[\"Is Passed\"]][\"name\"])\n", "names = f\"They are {names}.\"\n", "\n", "print(names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading Excel" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeVisitors
012
120
231
343
450
568
6732
7848
8921
91010
10119
111247
121342
131415
141511
151612
161719
171855
181957
192055
202132
212224
222311
23244
\n", "
" ], "text/plain": [ " Time Visitors\n", "0 1 2\n", "1 2 0\n", "2 3 1\n", "3 4 3\n", "4 5 0\n", "5 6 8\n", "6 7 32\n", "7 8 48\n", "8 9 21\n", "9 10 10\n", "10 11 9\n", "11 12 47\n", "12 13 42\n", "13 14 15\n", "14 15 11\n", "15 16 12\n", "16 17 19\n", "17 18 55\n", "18 19 57\n", "19 20 55\n", "20 21 32\n", "21 22 24\n", "22 23 11\n", "23 24 4" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel('visitors.xlsx')\n", "df" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeVisitors
012
120
231
343
450
\n", "
" ], "text/plain": [ " Time Visitors\n", "0 1 2\n", "1 2 0\n", "2 3 1\n", "3 4 3\n", "4 5 0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeVisitors
192055
202132
212224
222311
23244
\n", "
" ], "text/plain": [ " Time Visitors\n", "19 20 55\n", "20 21 32\n", "21 22 24\n", "22 23 11\n", "23 24 4" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 0\n", "2 1\n", "3 3\n", "4 0\n", "5 8\n", "6 32\n", "7 48\n", "8 21\n", "9 10\n", "10 9\n", "11 47\n", "12 42\n", "13 15\n", "14 11\n", "15 12\n", "16 19\n", "17 55\n", "18 57\n", "19 55\n", "20 32\n", "21 24\n", "22 11\n", "23 4\n", "Name: Visitors, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Visitors']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which hours do we have more than 30 visitors?" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeVisitors
6732
7848
111247
121342
171855
181957
192055
202132
\n", "
" ], "text/plain": [ " Time Visitors\n", "6 7 32\n", "7 8 48\n", "11 12 47\n", "12 13 42\n", "17 18 55\n", "18 19 57\n", "19 20 55\n", "20 21 32" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[ df['Visitors'] > 30 ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create new Boolean column with condition." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "df['Good Timing'] = df['Visitors']>30" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeVisitorsGood Timing
012False
120False
231False
343False
450False
568False
6732True
7848True
8921False
91010False
10119False
111247True
121342True
131415False
141511False
151612False
161719False
171855True
181957True
192055True
202132True
212224False
222311False
23244False
\n", "
" ], "text/plain": [ " Time Visitors Good Timing\n", "0 1 2 False\n", "1 2 0 False\n", "2 3 1 False\n", "3 4 3 False\n", "4 5 0 False\n", "5 6 8 False\n", "6 7 32 True\n", "7 8 48 True\n", "8 9 21 False\n", "9 10 10 False\n", "10 11 9 False\n", "11 12 47 True\n", "12 13 42 True\n", "13 14 15 False\n", "14 15 11 False\n", "15 16 12 False\n", "16 17 19 False\n", "17 18 55 True\n", "18 19 57 True\n", "19 20 55 True\n", "20 21 32 True\n", "21 22 24 False\n", "22 23 11 False\n", "23 24 4 False" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeVisitorsGood Timing
6732True
7848True
111247True
121342True
171855True
181957True
192055True
202132True
\n", "
" ], "text/plain": [ " Time Visitors Good Timing\n", "6 7 32 True\n", "7 8 48 True\n", "11 12 47 True\n", "12 13 42 True\n", "17 18 55 True\n", "18 19 57 True\n", "19 20 55 True\n", "20 21 32 True" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Good Timing']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading CSV" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datevisitors
02018-12-1822
12018-12-170
22018-12-164
32018-12-15218
42018-12-1411
\n", "
" ], "text/plain": [ " date visitors\n", "0 2018-12-18 22\n", "1 2018-12-17 0\n", "2 2018-12-16 4\n", "3 2018-12-15 218\n", "4 2018-12-14 11" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('visitors.csv', delimiter=',', names=('date','visitors'))\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2018-12-18\n", "1 2018-12-17\n", "2 2018-12-16\n", "3 2018-12-15\n", "4 2018-12-14\n", "Name: date, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['date'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching CSV from Wikipedia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can convert the data table into CSV by using https://wikitable2csv.ggor.de" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we can download the All-time Olympic Games medal table from Wikipedia:\n", "\n", "https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the converted CSV directly into pandas and process the data set. Alternatively, we can clean up the CSV a little bit before loading it into pandas.\n", "\n", "For instance, we can give each column a readable name in header. We can also convert thousands with comma too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab 4: All-time Olympic Games medal table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have prepared a `table-1-olympics.csv` file with basic format processed." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamSummer GamesGoldSilverBronzeTotalWinter GamesGold.1Silver.1Bronze.1Total.1Combined ParticipatedCombined GoldCombined SilverCombined BronzeCombined Total
0Afghanistan (AFG)14002200000140022
1Algeria (ALG)1354817300001654817
2Argentina (ARG)24212528741900004321252874
3Armenia (ARM)626614700001326614
4Australasia (ANZ) [ANZ]23451200000234512
\n", "
" ], "text/plain": [ " Team Summer Games Gold Silver Bronze Total \\\n", "0 Afghanistan (AFG) 14 0 0 2 2 \n", "1 Algeria (ALG) 13 5 4 8 17 \n", "2 Argentina (ARG) 24 21 25 28 74 \n", "3 Armenia (ARM) 6 2 6 6 14 \n", "4 Australasia (ANZ) [ANZ] 2 3 4 5 12 \n", "\n", " Winter Games Gold.1 Silver.1 Bronze.1 Total.1 Combined Participated \\\n", "0 0 0 0 0 0 14 \n", "1 3 0 0 0 0 16 \n", "2 19 0 0 0 0 43 \n", "3 7 0 0 0 0 13 \n", "4 0 0 0 0 0 2 \n", "\n", " Combined Gold Combined Silver Combined Bronze Combined Total \n", "0 0 0 2 2 \n", "1 5 4 8 17 \n", "2 21 25 28 74 \n", "3 2 6 6 14 \n", "4 3 4 5 12 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('table-1-olympics.csv')\n", "df = df[:-1] # Remove the last row of data\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(152, 16)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By checking the `dtype`, we find that the data in \"O\" instead of \"Int\". \"O\" stands for Object because it is treated as String." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "dtype('O')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Total'].dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can convert the dtype by using `astype`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "df['Total'] = df['Total'].astype(int)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dtype('int64')" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Total'].dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the team name is not quite readable. There are extra symbols and footnote indicators.\n", "\n", "We can clean up `Team` name and replace space into underscore:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "df[\"Team\"] = df[\"Team\"].apply(lambda x: x.split(\"(\")[0].strip().replace(\" \",\"_\"))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamSummer GamesGoldSilverBronzeTotalWinter GamesGold.1Silver.1Bronze.1Total.1Combined ParticipatedCombined GoldCombined SilverCombined BronzeCombined Total
0Afghanistan14002200000140022
1Algeria1354817300001654817
2Argentina24212528741900004321252874
3Armenia626614700001326614
4Australasia23451200000234512
\n", "
" ], "text/plain": [ " Team Summer Games Gold Silver Bronze Total Winter Games Gold.1 \\\n", "0 Afghanistan 14 0 0 2 2 0 0 \n", "1 Algeria 13 5 4 8 17 3 0 \n", "2 Argentina 24 21 25 28 74 19 0 \n", "3 Armenia 6 2 6 6 14 7 0 \n", "4 Australasia 2 3 4 5 12 0 0 \n", "\n", " Silver.1 Bronze.1 Total.1 Combined Participated Combined Gold \\\n", "0 0 0 0 14 0 \n", "1 0 0 0 16 5 \n", "2 0 0 0 43 21 \n", "3 0 0 0 13 2 \n", "4 0 0 0 2 3 \n", "\n", " Combined Silver Combined Bronze Combined Total \n", "0 0 2 2 \n", "1 4 8 17 \n", "2 25 28 74 \n", "3 6 6 14 \n", "4 4 5 12 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the Team name as dataframe index:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "df.set_index(\"Team\", inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make sure all numbers in the dataframe are numeric:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "df = df.astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can inspect the column names by using `df.columns`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|Expected result|\n", "|---|\n", "|Index(['Summer Games', 'Gold', 'Silver', 'Bronze', 'Total', 'Winter Games',\n", " 'Gold.1', 'Silver.1', 'Bronze.1', 'Total.1', 'Combined Participated',\n", " 'Combined Gold', 'Combined Silver', 'Combined Bronze',\n", " 'Combined Total'],\n", " dtype='object')|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access a particular row of data by using `iloc`" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Summer Games 22\n", "Gold 30\n", "Silver 36\n", "Bronze 63\n", "Total 129\n", "Winter Games 8\n", "Gold.1 0\n", "Silver.1 0\n", "Bronze.1 0\n", "Total.1 0\n", "Combined Participated 30\n", "Combined Gold 30\n", "Combined Silver 36\n", "Combined Bronze 63\n", "Combined Total 129\n", "Name: Brazil, dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[16]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also access a particular row of data by using the index label `df.loc[\"Brazil\"]`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the Olymnpic Medal data from China?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|Expected result|\n", "|---|\n", "|Summer Games 10\n", "Gold 224\n", "Silver 167\n", "Bronze 155\n", "Total 546\n", "Winter Games 11\n", "Gold.1 13\n", "Silver.1 28\n", "Bronze.1 21\n", "Total.1 62\n", "Combined Participated 21\n", "Combined Gold 237\n", "Combined Silver 195\n", "Combined Bronze 176\n", "Combined Total 608\n", "Name: China, dtype: int32|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many gold medal China won in Summer Olymnpics?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|Expected result|\n", "|---|\n", "|224|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get the rank of gold medal list by using `df.sort_values`." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Summer GamesGoldSilverBronzeTotalWinter GamesGold.1Silver.1Bronze.1Total.1Combined ParticipatedCombined GoldCombined SilverCombined BronzeCombined Total
Team
United_States271022795705252223105112883055011279077932827
Soviet_Union939531929610109785759194184733763551204
Great_Britain2826329529385123114173251274299310883
China10224167155546111328216221237195176608
France282122412637162336355312451248276316840
\n", "
" ], "text/plain": [ " Summer Games Gold Silver Bronze Total Winter Games \\\n", "Team \n", "United_States 27 1022 795 705 2522 23 \n", "Soviet_Union 9 395 319 296 1010 9 \n", "Great_Britain 28 263 295 293 851 23 \n", "China 10 224 167 155 546 11 \n", "France 28 212 241 263 716 23 \n", "\n", " Gold.1 Silver.1 Bronze.1 Total.1 Combined Participated \\\n", "Team \n", "United_States 105 112 88 305 50 \n", "Soviet_Union 78 57 59 194 18 \n", "Great_Britain 11 4 17 32 51 \n", "China 13 28 21 62 21 \n", "France 36 35 53 124 51 \n", "\n", " Combined Gold Combined Silver Combined Bronze Combined Total \n", "Team \n", "United_States 1127 907 793 2827 \n", "Soviet_Union 473 376 355 1204 \n", "Great_Britain 274 299 310 883 \n", "China 237 195 176 608 \n", "France 248 276 316 840 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by=\"Gold\", ascending=False)[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many teams earn gold medals in both summer Olympics and winder Olympics? " ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "38" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[(df[\"Gold\"] > 0) & (df[\"Gold.1\"] > 0)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which teams are they?" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GoldGold.1
Team
Australia1475
Austria1864
Belarus128
Belgium401
Bulgaria511
Canada6473
China22413
Croatia114
Czech_Republic159
Czechoslovakia492
Estonia94
Finland10143
France21236
Germany19192
United_Team_of_Germany288
East_Germany15339
West_Germany5611
Great_Britain26311
Hungary1751
Italy20640
Japan14214
Kazakhstan151
South_Korea9031
Latvia31
Netherlands8545
Norway56132
Poland687
Russia14847
Soviet_Union39578
Slovakia93
Slovenia52
Spain451
Sweden14557
Switzerland5056
Ukraine353
United_States1022105
Uzbekistan81
Unified_Team459
\n", "
" ], "text/plain": [ " Gold Gold.1\n", "Team \n", "Australia 147 5\n", "Austria 18 64\n", "Belarus 12 8\n", "Belgium 40 1\n", "Bulgaria 51 1\n", "Canada 64 73\n", "China 224 13\n", "Croatia 11 4\n", "Czech_Republic 15 9\n", "Czechoslovakia 49 2\n", "Estonia 9 4\n", "Finland 101 43\n", "France 212 36\n", "Germany 191 92\n", "United_Team_of_Germany 28 8\n", "East_Germany 153 39\n", "West_Germany 56 11\n", "Great_Britain 263 11\n", "Hungary 175 1\n", "Italy 206 40\n", "Japan 142 14\n", "Kazakhstan 15 1\n", "South_Korea 90 31\n", "Latvia 3 1\n", "Netherlands 85 45\n", "Norway 56 132\n", "Poland 68 7\n", "Russia 148 47\n", "Soviet_Union 395 78\n", "Slovakia 9 3\n", "Slovenia 5 2\n", "Spain 45 1\n", "Sweden 145 57\n", "Switzerland 50 56\n", "Ukraine 35 3\n", "United_States 1022 105\n", "Uzbekistan 8 1\n", "Unified_Team 45 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df[\"Gold\"] > 0) & (df[\"Gold.1\"] > 0)][['Gold','Gold.1']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about the team that wins gold medals in winter Olympics but not summer Olympics?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|Expected result|\n", "|---|\n", "|['Liechtenstein', 'Olympic_Athletes_from_Russia']|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Reading DSEC Visitors.xlsx" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
時期按證件簽發地統計之旅客按證件簽發地統計之留宿旅客按證件簽發地統計之不過夜旅客
12011年1月1.12.20.2
22011年2月120.2
32011年3月12.10.2
42011年4月120.2
52011年5月1.12.10.2
\n", "
" ], "text/plain": [ " 時期 按證件簽發地統計之旅客 按證件簽發地統計之留宿旅客 按證件簽發地統計之不過夜旅客\n", "1 2011年1月 1.1 2.2 0.2\n", "2 2011年2月 1 2 0.2\n", "3 2011年3月 1 2.1 0.2\n", "4 2011年4月 1 2 0.2\n", "5 2011年5月 1.1 2.1 0.2" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel(\"dsec_visitors.xlsx\", skiprows=4, skipfooter=4)\n", "df.columns\n", "\n", "df = df.rename(columns={\n", " \"Unnamed: 0\": \"時期\",\n", " \"按證件簽發地統計之旅客 b\": \"按證件簽發地統計之旅客\",\n", " \"按證件簽發地統計之留宿旅客 b\": \"按證件簽發地統計之留宿旅客\",\n", " \"按證件簽發地統計之不過夜旅客 b\": \"按證件簽發地統計之不過夜旅客\"\n", "})\n", "df = df.drop(columns=[\"Unnamed: 2\", \"Unnamed: 4\", \"Unnamed: 6\"])\n", "df = df[1:]\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def year_from_dsec_date_period(period):\n", " \"\"\"Convert 2011年5月 into 2011\"\"\"\n", " return int(period.split(\"年\")[0])\n", "def month_from_dsec_date_period(period):\n", " \"\"\"Convert 2011年5月 into 2011\"\"\"\n", " return int(period.split(\"年\")[1].replace(\"月\",\"\"))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
時期按證件簽發地統計之旅客按證件簽發地統計之留宿旅客按證件簽發地統計之不過夜旅客YearMonth
12011年1月1.12.20.220111
22011年2月120.220112
32011年3月12.10.220113
42011年4月120.220114
52011年5月1.12.10.220115
.....................
1152020年7月1.34.70.120207
1162020年8月0.93.30.120208
1172020年9月12.70.120209
1182020年10月1.42.70.1202010
1192020年11月1.73.10.1202011
\n", "

119 rows × 6 columns

\n", "
" ], "text/plain": [ " 時期 按證件簽發地統計之旅客 按證件簽發地統計之留宿旅客 按證件簽發地統計之不過夜旅客 Year Month\n", "1 2011年1月 1.1 2.2 0.2 2011 1\n", "2 2011年2月 1 2 0.2 2011 2\n", "3 2011年3月 1 2.1 0.2 2011 3\n", "4 2011年4月 1 2 0.2 2011 4\n", "5 2011年5月 1.1 2.1 0.2 2011 5\n", ".. ... ... ... ... ... ...\n", "115 2020年7月 1.3 4.7 0.1 2020 7\n", "116 2020年8月 0.9 3.3 0.1 2020 8\n", "117 2020年9月 1 2.7 0.1 2020 9\n", "118 2020年10月 1.4 2.7 0.1 2020 10\n", "119 2020年11月 1.7 3.1 0.1 2020 11\n", "\n", "[119 rows x 6 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Year\"] = df[\"時期\"].apply(year_from_dsec_date_period)\n", "df[\"Month\"] = df[\"時期\"].apply(month_from_dsec_date_period)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Summary\n", "\n", "In this lesson, we learn to process tabular data by using Pandas." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }