{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 3\n",
    "\n",
    "## Video 13: Pandas\n",
    "**Python for the Energy Industry**\n",
    "\n",
    "Pandas is a python module design for working with tabular data. The core data structure of pandas is the DataFrame. DataFrames share a lot in common with numpy arrays, with two main differences:\n",
    "- They can also store non-numeric data\n",
    "- The columns in a DataFrame are generally given text labels\n",
    "\n",
    "A DataFrame can be created by pandas from a dictionary, where each key is a column label, and the values are the corresponding values for each 'entry'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Country  Population    HDI\n",
      "0   China  1439323776  0.758\n",
      "1      US   331002651  0.920\n",
      "2  Russia   145934462  0.824\n",
      "3      UK    67886011  0.920\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "country_df = pd.DataFrame({\n",
    "    'Country': ['China','US','Russia','UK'],\n",
    "    'Population': [1439323776, 331002651, 145934462, 67886011],\n",
    "    'HDI': [0.758, 0.920, 0.824, 0.920]\n",
    "})\n",
    "\n",
    "print(country_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, there a 4 entries representing countries, and corresponding population and HDI values. A particular column can be accessed in the same way as data is accessed in a dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1439323776\n",
      "1     331002651\n",
      "2     145934462\n",
      "3      67886011\n",
      "Name: Population, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(country_df['Population'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or multiple columns can be accessed at once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Country  Population\n",
      "0   China  1439323776\n",
      "1      US   331002651\n",
      "2  Russia   145934462\n",
      "3      UK    67886011\n"
     ]
    }
   ],
   "source": [
    "print(country_df[['Country','Population']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that these entries have numeric indicies from 0-3. We can also use text labels for indices instead, by setting one of the columns to be the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         Population    HDI\n",
      "Country                   \n",
      "China    1439323776  0.758\n",
      "US        331002651  0.920\n",
      "Russia    145934462  0.824\n",
      "UK         67886011  0.920\n"
     ]
    }
   ],
   "source": [
    "country_df.set_index('Country',inplace=True)\n",
    "\n",
    "print(country_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can make it a bit easier to read data from a single column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Country\n",
      "China     0.758\n",
      "US        0.920\n",
      "Russia    0.824\n",
      "UK        0.920\n",
      "Name: HDI, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "print(country_df['HDI'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also access all data corresponding to a single entry in the DataFrame. This can be done either by the entry name (if text indices are being used) or by its numerical index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Population    1.459345e+08\n",
      "HDI           8.240000e-01\n",
      "Name: Russia, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Accessing the third entry in the DataFrame\n",
    "print(country_df.iloc[2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Population    1.439324e+09\n",
      "HDI           7.580000e-01\n",
      "Name: China, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Accessing the entry with the index 'China'\n",
    "print(country_df.loc['China'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Note: we will be working a lot with pandas a lot throughout the course. If you want to learn more about any particular features of pandas, check out the [pandas documentation.](https://pandas.pydata.org/docs/)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "\n",
    "You can use numpy arrays as a source of data when creating a DataFrame. Make a DataFrame with two columns 'A' and 'B', each of which have 10 random numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}