{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 3\n", "\n", "## Video 13: Pandas\n", "**Python for the Energy Industry**\n", "\n", "Pandas is a python module design for working with tabular data. The core data structure of pandas is the DataFrame. DataFrames share a lot in common with numpy arrays, with two main differences:\n", "- They can also store non-numeric data\n", "- The columns in a DataFrame are generally given text labels\n", "\n", "A DataFrame can be created by pandas from a dictionary, where each key is a column label, and the values are the corresponding values for each 'entry'." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country Population HDI\n", "0 China 1439323776 0.758\n", "1 US 331002651 0.920\n", "2 Russia 145934462 0.824\n", "3 UK 67886011 0.920\n" ] } ], "source": [ "import pandas as pd\n", "\n", "country_df = pd.DataFrame({\n", " 'Country': ['China','US','Russia','UK'],\n", " 'Population': [1439323776, 331002651, 145934462, 67886011],\n", " 'HDI': [0.758, 0.920, 0.824, 0.920]\n", "})\n", "\n", "print(country_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, there a 4 entries representing countries, and corresponding population and HDI values. A particular column can be accessed in the same way as data is accessed in a dictionary:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1439323776\n", "1 331002651\n", "2 145934462\n", "3 67886011\n", "Name: Population, dtype: int64\n" ] } ], "source": [ "print(country_df['Population'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or multiple columns can be accessed at once:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country Population\n", "0 China 1439323776\n", "1 US 331002651\n", "2 Russia 145934462\n", "3 UK 67886011\n" ] } ], "source": [ "print(country_df[['Country','Population']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that these entries have numeric indicies from 0-3. We can also use text labels for indices instead, by setting one of the columns to be the index:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Population HDI\n", "Country \n", "China 1439323776 0.758\n", "US 331002651 0.920\n", "Russia 145934462 0.824\n", "UK 67886011 0.920\n" ] } ], "source": [ "country_df.set_index('Country',inplace=True)\n", "\n", "print(country_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can make it a bit easier to read data from a single column:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Country\n", "China 0.758\n", "US 0.920\n", "Russia 0.824\n", "UK 0.920\n", "Name: HDI, dtype: float64\n" ] } ], "source": [ "print(country_df['HDI'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also access all data corresponding to a single entry in the DataFrame. This can be done either by the entry name (if text indices are being used) or by its numerical index." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Population 1.459345e+08\n", "HDI 8.240000e-01\n", "Name: Russia, dtype: float64\n" ] } ], "source": [ "# Accessing the third entry in the DataFrame\n", "print(country_df.iloc[2])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Population 1.439324e+09\n", "HDI 7.580000e-01\n", "Name: China, dtype: float64\n" ] } ], "source": [ "# Accessing the entry with the index 'China'\n", "print(country_df.loc['China'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note: we will be working a lot with pandas a lot throughout the course. If you want to learn more about any particular features of pandas, check out the [pandas documentation.](https://pandas.pydata.org/docs/)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "You can use numpy arrays as a source of data when creating a DataFrame. Make a DataFrame with two columns 'A' and 'B', each of which have 10 random numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }