{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 3\n", "\n", "## Video 14: Working with DataFrames\n", "**Python for the Energy Industry**\n", "\n", "There are a number of built-in functions for exploring the data in a DataFrame. We'll be working with a larger example DataFrame:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryRegionPopulationOil ProductionProven Reserves
0AlgeriaNorth Africa422284081348361.01.220000e+10
1AngolaSouthern Africa308097871769615.08.423000e+09
2Equatorial GuineaCentral Africa1308975NaNNaN
3GabonCentral Africa2119275210820.02.000000e+09
4IranMiddle East818001883990956.01.575300e+11
5IraqMiddle East384336004451516.01.430690e+11
6KuwaitMiddle East41373122923825.01.015000e+11
7LibyaNorth Africa6678559384686.04.836300e+10
8NigeriaWest Africa1958746851999885.03.707000e+10
9Republic of the CongoCentral Africa5125821260000.01.600000e+09
10Saudi ArabiaMiddle East3370275610460710.02.665780e+11
11UAEMiddle East96309593106077.09.780000e+10
12VenezuelaSouth America288871182276967.02.999530e+11
\n", "
" ], "text/plain": [ " Country Region Population Oil Production \\\n", "0 Algeria North Africa 42228408 1348361.0 \n", "1 Angola Southern Africa 30809787 1769615.0 \n", "2 Equatorial Guinea Central Africa 1308975 NaN \n", "3 Gabon Central Africa 2119275 210820.0 \n", "4 Iran Middle East 81800188 3990956.0 \n", "5 Iraq Middle East 38433600 4451516.0 \n", "6 Kuwait Middle East 4137312 2923825.0 \n", "7 Libya North Africa 6678559 384686.0 \n", "8 Nigeria West Africa 195874685 1999885.0 \n", "9 Republic of the Congo Central Africa 5125821 260000.0 \n", "10 Saudi Arabia Middle East 33702756 10460710.0 \n", "11 UAE Middle East 9630959 3106077.0 \n", "12 Venezuela South America 28887118 2276967.0 \n", "\n", " Proven Reserves \n", "0 1.220000e+10 \n", "1 8.423000e+09 \n", "2 NaN \n", "3 2.000000e+09 \n", "4 1.575300e+11 \n", "5 1.430690e+11 \n", "6 1.015000e+11 \n", "7 4.836300e+10 \n", "8 3.707000e+10 \n", "9 1.600000e+09 \n", "10 2.665780e+11 \n", "11 9.780000e+10 \n", "12 2.999530e+11 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "OPEC_df = pd.DataFrame({\n", " 'Country': ['Algeria','Angola','Equatorial Guinea','Gabon','Iran','Iraq','Kuwait','Libya','Nigeria','Republic of the Congo','Saudi Arabia','UAE','Venezuela'],\n", " 'Region': ['North Africa','Southern Africa','Central Africa','Central Africa','Middle East','Middle East','Middle East','North Africa','West Africa','Central Africa','Middle East','Middle East','South America'],\n", " 'Population': [42228408,30809787,1308975,2119275,81800188,38433600,4137312,6678559,195874685,5125821,33702756,9630959,28887118],\n", " 'Oil Production': [1348361,1769615,np.nan,210820,3990956,4451516,2923825,384686,1999885,260000,10460710,3106077,2276967],\n", " 'Proven Reserves': [12.2e9,8.423e9,np.nan,2e9,157.53e9,143.069e9,101.5e9,48.363e9,37.07e9,1.6e9,266.578e9,97.8e9,299.953e9]\n", "})\n", "\n", "OPEC_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For large DataFrames, rather than printing out large amounts of data, we can take a peek at some data with the 'head' and 'tail' functions:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryRegionPopulationOil ProductionProven Reserves
0AlgeriaNorth Africa422284081348361.01.220000e+10
1AngolaSouthern Africa308097871769615.08.423000e+09
2Equatorial GuineaCentral Africa1308975NaNNaN
\n", "
" ], "text/plain": [ " Country Region Population Oil Production \\\n", "0 Algeria North Africa 42228408 1348361.0 \n", "1 Angola Southern Africa 30809787 1769615.0 \n", "2 Equatorial Guinea Central Africa 1308975 NaN \n", "\n", " Proven Reserves \n", "0 1.220000e+10 \n", "1 8.423000e+09 \n", "2 NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "OPEC_df.head(3)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryRegionPopulationOil ProductionProven Reserves
10Saudi ArabiaMiddle East3370275610460710.02.665780e+11
11UAEMiddle East96309593106077.09.780000e+10
12VenezuelaSouth America288871182276967.02.999530e+11
\n", "
" ], "text/plain": [ " Country Region Population Oil Production Proven Reserves\n", "10 Saudi Arabia Middle East 33702756 10460710.0 2.665780e+11\n", "11 UAE Middle East 9630959 3106077.0 9.780000e+10\n", "12 Venezuela South America 28887118 2276967.0 2.999530e+11" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "OPEC_df.tail(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also take a look at what columns we have, and what type of data they store, with the 'info' function." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 13 entries, 0 to 12\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Country 13 non-null object \n", " 1 Region 13 non-null object \n", " 2 Population 13 non-null int64 \n", " 3 Oil Production 12 non-null float64\n", " 4 Proven Reserves 12 non-null float64\n", "dtypes: float64(2), int64(1), object(2)\n", "memory usage: 648.0+ bytes\n" ] } ], "source": [ "OPEC_df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get a statistical description of our data. The 'describe' function will return basic stats on all columns with numeric data. There are also functions for each individual statistic." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationOil ProductionProven Reserves
count1.300000e+011.200000e+011.200000e+01
mean3.697980e+072.765285e+069.800717e+10
std5.295242e+072.796082e+061.021860e+11
min1.308975e+062.108200e+051.600000e+09
25%5.125821e+061.107442e+061.125575e+10
50%2.888712e+072.138426e+067.308150e+10
75%3.843360e+073.327297e+061.466842e+11
max1.958747e+081.046071e+072.999530e+11
\n", "
" ], "text/plain": [ " Population Oil Production Proven Reserves\n", "count 1.300000e+01 1.200000e+01 1.200000e+01\n", "mean 3.697980e+07 2.765285e+06 9.800717e+10\n", "std 5.295242e+07 2.796082e+06 1.021860e+11\n", "min 1.308975e+06 2.108200e+05 1.600000e+09\n", "25% 5.125821e+06 1.107442e+06 1.125575e+10\n", "50% 2.888712e+07 2.138426e+06 7.308150e+10\n", "75% 3.843360e+07 3.327297e+06 1.466842e+11\n", "max 1.958747e+08 1.046071e+07 2.999530e+11" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "OPEC_df.describe()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2765284.8333333335" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get the mean of a single column\n", "OPEC_df['Oil Production'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading & Writing to File\n", "\n", "Pandas can also be used to write a DataFrame into an excel/csv format, or read in a DataFrame from a file. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# write OPEC_df to an excel spreadsheet\n", "OPEC_df.to_csv('OPEC_df.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryRegionPopulationOil ProductionProven Reserves
0AlgeriaNorth Africa422284081348361.01.220000e+10
1AngolaSouthern Africa308097871769615.08.423000e+09
2Equatorial GuineaCentral Africa1308975NaNNaN
3GabonCentral Africa2119275210820.02.000000e+09
4IranMiddle East818001883990956.01.575300e+11
5IraqMiddle East384336004451516.01.430690e+11
6KuwaitMiddle East41373122923825.01.015000e+11
7LibyaNorth Africa6678559384686.04.836300e+10
8NigeriaWest Africa1958746851999885.03.707000e+10
9Republic of the CongoCentral Africa5125821260000.01.600000e+09
10Saudi ArabiaMiddle East3370275610460710.02.665780e+11
11UAEMiddle East96309593106077.09.780000e+10
12VenezuelaSouth America288871182276967.02.999530e+11
\n", "
" ], "text/plain": [ " Country Region Population Oil Production \\\n", "0 Algeria North Africa 42228408 1348361.0 \n", "1 Angola Southern Africa 30809787 1769615.0 \n", "2 Equatorial Guinea Central Africa 1308975 NaN \n", "3 Gabon Central Africa 2119275 210820.0 \n", "4 Iran Middle East 81800188 3990956.0 \n", "5 Iraq Middle East 38433600 4451516.0 \n", "6 Kuwait Middle East 4137312 2923825.0 \n", "7 Libya North Africa 6678559 384686.0 \n", "8 Nigeria West Africa 195874685 1999885.0 \n", "9 Republic of the Congo Central Africa 5125821 260000.0 \n", "10 Saudi Arabia Middle East 33702756 10460710.0 \n", "11 UAE Middle East 9630959 3106077.0 \n", "12 Venezuela South America 28887118 2276967.0 \n", "\n", " Proven Reserves \n", "0 1.220000e+10 \n", "1 8.423000e+09 \n", "2 NaN \n", "3 2.000000e+09 \n", "4 1.575300e+11 \n", "5 1.430690e+11 \n", "6 1.015000e+11 \n", "7 4.836300e+10 \n", "8 3.707000e+10 \n", "9 1.600000e+09 \n", "10 2.665780e+11 \n", "11 9.780000e+10 \n", "12 2.999530e+11 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a new DataFrame from OPEC_df.csv\n", "OPEC_df_copy = pd.read_csv('OPEC_df.csv')\n", "OPEC_df_copy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Obtain a spreadsheet with some data you would like to explore. You can use the example 'countries.csv' on the course page.\n", "\n", "Make sure the spreadsheet is in the sample folder as this notebook. Read the file in as a DataFrame, and use the above functions to explore the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }