{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"left\" src=\"../All-sample-files/CC_BY.png\"><br />\n",
    "\n",
    "This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](https://github.com/ithaka/tdm-notebooks/blob/e6275296c010280909e90e3ea47922d52d99c5a7/pandas-1.ipynb), [William Mattingly](https://github.com/wjbmattingly/tap-2022-pandas) and [Melanie Walsh](https://github.com/melaniewalsh/Data-Analysis-with-Pandas) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).<br />\n",
    "For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.<br />\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas Basics 1\n",
    "\n",
    "**Description:** This notebook describes how to:\n",
    "* Create a Pandas Series or a DataFrame\n",
    "* Create a dataframe from files of different formats\n",
    "* Explore the data in a dataframe\n",
    "* Access data from a dataframe\n",
    "* Set values in a dataframe\n",
    "* Create a new column based on an existing one\n",
    "\n",
    "This is the first notebook in a series on learning to use Pandas. \n",
    "\n",
    "**Use Case:** For Learners (Detailed explanation, not ideal for researchers)\n",
    "\n",
    "**Difficulty:** Beginner\n",
    "\n",
    "**Knowledge Required:** \n",
    "* Python Basics ([Start Python Basics I](../Python-basics/python-basics-1.ipynb))\n",
    "\n",
    "**Knowledge Recommended:** \n",
    "* [Python Intermediate 2](../Python-intermediate/python-intermediate-2.ipynb)\n",
    "* [Python Intermediate 4](../Python-intermediate/python-intermediate-4.ipynb)\n",
    "\n",
    "**Completion Time:** 90 minutes\n",
    "\n",
    "**Data Format:** .csv, .xsxl\n",
    "\n",
    "**Libraries Used:** Pandas\n",
    "\n",
    "**Research Pipeline:** None\n",
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "<center><img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png\" width=\"500\"></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is a Python library that allows you to easily work with tabular data. Most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.\n",
    "\n",
    "**Advantages of Spreadsheet Software**\n",
    "* Point and click\n",
    "* Easier to learn\n",
    "* Great for small datasets (<10,000 rows)\n",
    "* Better for browsing data\n",
    "\n",
    "**Advantages of Pandas**\n",
    "* More powerful data manipulation with Python\n",
    "* Can work with large datasets (millions of rows)\n",
    "* Faster for complicated manipulations\n",
    "* Better for cleaning and/or pre-processing data\n",
    "* Can automate workflows in a larger data pipeline\n",
    "\n",
    "In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "import sys\n",
    "\n",
    "def install(package):\n",
    "    subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", package])\n",
    "\n",
    "# Install your specific packages\n",
    "packages = [\n",
    "    'beautifulsoup4==4.12.2',\n",
    "    'click==8.1.3',\n",
    "    'gensim==4.3.1',\n",
    "    'ipympl==0.9.3',\n",
    "    'jupyter-ai==2.19.1',\n",
    "    'jupyter-ai-magics==2.19.0',\n",
    "    'jupyterlab-git==0.50.0',\n",
    "    'matplotlib==3.8.4',\n",
    "    'numpy>=1.16',\n",
    "    'nltk==3.9.1',\n",
    "    'openai==1.51.0',\n",
    "    'pandas>=2.0.3',\n",
    "    'pillow==10.3.0',\n",
    "    'pyarrow==14.0.1',\n",
    "    'pyldavis==3.4.1',\n",
    "    'pytesseract==0.3.10',\n",
    "    'regex==2023.6.3',\n",
    "    'requests==2.32.3',\n",
    "    'scikit-learn==1.5.1',\n",
    "    'scipy==1.11.1',\n",
    "    'seaborn==0.12.2',\n",
    "    'spacy==3.5.4',\n",
    "    'urllib3==2.2.2',\n",
    "    'vadersentiment==3.3.2',\n",
    "    'wordcloud==1.9.2',\n",
    "    'zipp==3.19.2',\n",
    "    'openpyxl'\n",
    "]\n",
    "\n",
    "for package in packages:\n",
    "    install(package)\n",
    "\n",
    "# Additional setup for specific packages that need extra data/models\n",
    "print(\"Setting up additional package data...\")\n",
    "\n",
    "# NLTK data downloads\n",
    "try:\n",
    "    import nltk\n",
    "    nltk.download('punkt', quiet=True)\n",
    "    nltk.download('stopwords', quiet=True)\n",
    "    nltk.download('vader_lexicon', quiet=True)\n",
    "    nltk.download('wordnet', quiet=True)\n",
    "    nltk.download('omw-1.4', quiet=True)\n",
    "    print(\"✓ NLTK data downloaded\")\n",
    "except Exception as e:\n",
    "    print(f\"⚠ NLTK setup issue: {e}\")\n",
    "\n",
    "print(\"Package installation and setup complete!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas Series and Pandas DataFrame\n",
    "In Pandas, data are stored in two fundamental objects: \n",
    "\n",
    "* **Pandas Series** - a single column or row of data\n",
    "* **Pandas DataFrame** - a table of data containing multiple columns and rows\n",
    "\n",
    "### Pandas Series\n",
    "\n",
    "We can think of a Series as a single column or row of data. Here we have a column called `Champions` with the country names of the winners of the most recent ten FIFA world cup games.\n",
    "\n",
    "|Champions|\n",
    "|---|\n",
    "|Argentina|\n",
    "|France|\n",
    "|Germany|\n",
    "|Spain|\n",
    "|Italy|\n",
    "|Brazil|\n",
    "|France|\n",
    "|Brazil|\n",
    "|Germany|\n",
    "|Argentina|\n",
    "\n",
    "Let's create a Series based on this column. To create our Series, we pass a **list** into the `.Series()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create a data series object in Pandas\n",
    "champions = pd.Series([\"Argentina\",\n",
    "                       \"France\", \n",
    "                       \"Germany\", \n",
    "                       \"Spain\", \n",
    "                       \"Italy\", \n",
    "                       \"Brazil\", \n",
    "                       \"France\", \n",
    "                       \"Brazil\", \n",
    "                       \"Germany\",\n",
    "                       \"Argentina\"]\n",
    "                    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Take a look at the Series\n",
    "champions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, except the data column, we also have an index column. By default, the indexes are numbers starting from 0. We could define the indexes ourselves. To do that, we will pass a **dictionary** to the `.Series()` method. The keys of the dictionary will be used as indexes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Use self-defined indexes\n",
    "pd.Series({2022: \"Argentina\", \n",
    "           2018: \"France\", \n",
    "           2014: \"Germany\", \n",
    "           2010: \"Spain\", \n",
    "           2006: \"Italy\", \n",
    "           2002: \"Brazil\", \n",
    "           1998: \"France\", \n",
    "           1994: \"Brazil\", \n",
    "           1990: \"Germany\",\n",
    "           1986: \"Argentina\"}\n",
    "         )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can give a name to your Pandas Series using the `name` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# give a name to the series\n",
    "pd.Series({2022: \"Argentina\", \n",
    "           2018: \"France\", \n",
    "           2014: \"Germany\", \n",
    "           2010: \"Spain\", \n",
    "           2006: \"Italy\", \n",
    "           2002: \"Brazil\", \n",
    "           1998: \"France\", \n",
    "           1994: \"Brazil\", \n",
    "           1990: \"Germany\",\n",
    "           1986: \"Argentina\"}, \n",
    "          name = 'World Cup Champions'\n",
    "         )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pandas DataFrame\n",
    "\n",
    "While a Pandas Series is 1-dimensional with a single column/row of data, a Pandas DataFrame is 2-dimensional and can have multiple columns and rows. \n",
    "\n",
    "|Year|Champion|Host|\n",
    "|---|---|---|\n",
    "|2022|Argentina|Qatar|\n",
    "|2018|France|Russia|\n",
    "|2014|Germany|Brazil|\n",
    "|2010|Spain|South Africa|\n",
    "|2006|Italy|Germany|\n",
    "|2002|Brazil|Korea/Japan|\n",
    "|1998|France|France|\n",
    "|1994|Brazil|USA|\n",
    "|1990|Germany|Italy|\n",
    "|1986|Argentina|Mexico|\n",
    "\n",
    "Let's create a Pandas DataFrame based on this table. To create our dataframe, we pass a **dictionary** into the `DataFrame()` method. Each `key:value` pair will form a column in the dataframe, with the key as the column name and the value as the data in that column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create a Pandas dataframe\n",
    "wcup = pd.DataFrame({\"Year\": [2022, \n",
    "                              2018, \n",
    "                              2014, \n",
    "                              2010, \n",
    "                              2006, \n",
    "                              2002, \n",
    "                              1998, \n",
    "                              1994, \n",
    "                              1990,\n",
    "                              1986], \n",
    "                     \"Champion\": [\"Argentina\", \n",
    "                                  \"France\", \n",
    "                                  \"Germany\", \n",
    "                                  \"Spain\", \n",
    "                                  \"Italy\", \n",
    "                                  \"Brazil\", \n",
    "                                  \"France\", \n",
    "                                  \"Brazil\", \n",
    "                                  \"Germany\", \n",
    "                                  \"Argentina\"], \n",
    "                     \"Host\": [\"Qatar\", \n",
    "                              \"Russia\", \n",
    "                              \"Brazil\", \n",
    "                              \"South Africa\", \n",
    "                              \"Germany\", \n",
    "                              \"Korea/Japan\", \n",
    "                              \"France\", \n",
    "                              \"USA\", \n",
    "                              \"Italy\", \n",
    "                              \"Mexico\"]\n",
    "                    })\n",
    "\n",
    "wcup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a Pandas dataframe, each row/column is technically a Pandas Series. We can see this by selecting the first row with the `iloc` method and check its type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the type of a row in a dataframe\n",
    "type(wcup.iloc[0]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's select a column and check its type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the type of a column in a dataframe\n",
    "type(wcup['Champion'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will describe row/column selection in greater detail below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:red; display:inline\">Coding Challenge! &lt; / &gt; </h3>\n",
    "\n",
    "You are a middle school teacher. You teach the Butterfly Class and the Hippo Class. Last week, the Butterfly class had an English test and a math test. You would like to make a dataframe to record the English grades and math grades of the students in the Butterfly Class. \n",
    "\n",
    "Make a dataframe with three columns: name, English and Math."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a dataframe\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the data\n",
    "\n",
    "After we build a dataframe, it is helpful to get a general idea of the data. The first step is to explore the dataframe's attributes. Attributes are properties of the dataframe (not functions), so they do not have parentheses `()` after them. \n",
    "\n",
    "|Attribute|Reveals|\n",
    "|---|---|\n",
    "|.shape| The number of rows and columns|\n",
    "|.columns| The name of each column|\n",
    "\n",
    "\n",
    "To get how many rows and columns a dataframe has, we use the `.shape` attribute. `df.shape` returns a tuple with (number of rows, number of columns)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# df.shape returns a tuple (# of rows, # of columns)\n",
    "wcup.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Use `.columns` attribute to find the column names\n",
    "wcup.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some methods we can use to explore the data as well. \n",
    "\n",
    "\n",
    "|Method|Reveals|\n",
    "|---|---|\n",
    "|.info( )| Column count and data type|\n",
    "|.head( )| First five rows|\n",
    "|.tail( )|Last five rows|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Use `.info()` to get column count and data type\n",
    "wcup.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get a preview of the dataframe. The `.head()` and `.tail()` methods help us do that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Display the first five rows of the data\n",
    "wcup.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Display the last five rows of the data\n",
    "wcup.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Specify the number of rows at the beginning of the table to display\n",
    "wcup.head(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read and write tabular data in Pandas\n",
    "\n",
    "Pandas provides different methods to read and write tabular data. The methods used to read in data from files are `.read_*()`. The methods used to write data into files are `.to_*()`.\n",
    "\n",
    "For example, we can create a dataframe from a csv file using the `.read_csv()` method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "### Create the data folder\n",
    "from pathlib import Path\n",
    "import urllib.request\n",
    "\n",
    "# Check if a data folder exists. If not, create it.\n",
    "data_folder = Path('./data/')\n",
    "data_folder.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Use the read_csv() method to create a dataframe\n",
    "failed_banks = pd.read_csv('../All-sample-files/Pandas1_failed_banks_since_2000.csv')\n",
    "failed_banks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also write the tabular data from a dataframe into a file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# write the dataframe we created for the world cup champions into a file\n",
    "wcup.to_csv('./data/wcup_champions.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas can read data from files of many different formats and write data into files of many different formats. \n",
    "<center><img src=\"../All-sample-files/Pandas1_DiffFormats.png\" width=\"700\"></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose you would like to read the data of COVID-19 cases in Massachusetts into a dataframe. Can you use a Pandas method to do that? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "file = '../All-sample-files/Pandas1_covid_MA_06292023.xlsx'\n",
    "\n",
    "# Read in the data\n",
    "covid = pd.read_excel(file)\n",
    "covid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, can you write the dataframe you created into a json file?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# write the dataframe into a json file\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access data\n",
    "In this section, we will take a look at the different ways of accessing the data in a dataframe. \n",
    "\n",
    "For example, once you get the column names, you could access a column of your interest. You can use the bracket notation `df[ColumnName]` to get a specific column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Use bracket notation to access the column 'Champion'\n",
    "wcup['Champion']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also access multiple columns from a dataframe by putting the column names in a list. Note that in this case, you have two layers of hard brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access multiple columns\n",
    "wcup[['Year','Champion']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The indexers `.iloc` and `.loc`\n",
    "In Pandas, there are two indexers `.iloc` and `.loc` that are often used to access data in a dataframe. \n",
    "#### .iloc\n",
    "`.iloc` allows us to access a row or a column using its *integer location*.\n",
    "\n",
    "Recall that in a dataframe, by default, to the left of each row are index numbers. The index numbers are similar to the index numbers for a Python list; they help us reference a particular row for data retrieval. Also, like a Python list, the index begins with 0. \n",
    "\n",
    "We can retrieve data using the `.iloc` attribute. The syntax of `.iloc` indexer is `df.iloc[row selection, column selection]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access a single row\n",
    "wcup.iloc[5] # Access the row with the index number 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we select multiple consecutive rows from a dataframe, we give a starting index and an ending index. Notice that the selected rows will **not** include the final index row. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access multiple consecutive rows\n",
    "wcup.iloc[2:5] # Access the rows with the index number 2, 3, and 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access multiple non-consecutive rows\n",
    "wcup.iloc[[0,2,5]] # Access the rows with the index number 0, 2, and 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# access every other row in wcup\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have seen how we can access rows from a dataframe using the `.iloc` indexer. In the following, we will use the `.iloc` indexer to access columns. Recall that the syntax of `.iloc` indexer is `df.iloc[row selection, column selection]`. Again, the index numbers for the columns are similar to the index numbers for a Python list; they help us reference a particular column for data retrieval. Also, like a Python list, the index begins with 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access a single column\n",
    "wcup.iloc[:,1] # Access the column with the index number 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we cannot use the column name, i.e., the header, to access a column because `.iloc` accesses data using their *integer location*. If we try to access a column using its column name, we get an error!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# .iloc cannot access a column by its name\n",
    "wcup.iloc[:,'Champion']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use integer slice to access multiple columns from a dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access multiple consecutive columns \n",
    "wcup.iloc[:,1:3] # Access the second and third column of the dataframe wcup "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access multiple non-consecutive columns\n",
    "wcup.iloc[:,[0,2]] # Access the first and third column of the dataframe wcup "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you know how to select rows and columns from a dataframe using `.iloc`. You should be able to figure out how to get a slice of a dataframe using `.iloc`. For example, if you would like to know the champion of the world cup games between 1994 and 2010. How do you slice the dataframe `wcup` to get the part you are interested in?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Slice the dataframe using .iloc[ ]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### .loc\n",
    "While `.iloc` is integer-based, `.loc` is label-based. It means that you have to access rows and columns based on their row and column labels.\n",
    "\n",
    "The syntax of `.loc` is `df.loc[row selection, column selection]`.\n",
    "\n",
    "At the moment, the labels for the rows are just their index numbers. When we use `.loc` to access a row, it will look very similar to what we did with `.iloc`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access a row using .loc\n",
    "wcup.loc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But we could make our index column customized. For example, we could use the column `Year` as the index column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set the column 'Year' as the index column\n",
    "wcup = wcup.set_index('Year')\n",
    "wcup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we make the change, we will use the new labels to access the rows. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access a row using .loc\n",
    "wcup.loc[2006]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access multiple consecutive rows\n",
    "wcup.loc[2018:2010] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that with the label search, the ending index row is included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access multiple non-consecutive rows\n",
    "wcup.loc[[1994, 2002, 2010]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Access a column\n",
    "wcup.loc[:, 'Host']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you know how to select rows and columns from a dataframe using `.loc[ ]`. You should be able to figure out how to get a slice of a dataframe using `.loc[ ]`. For example, if you would like to know the champion of the world cup games between 1994 and 2010. How do you slice the dataframe `wcup` to get the part you are interested in?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Slice the dataframe using .loc[ ]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**As a quick reminder**, remember that `.iloc[]` slicing is not inclusive of the final value. On the other hand, `.loc[]` slicing does include the final value. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The indexers `.iat` and `.at`\n",
    "\n",
    "We have learned how to use the two indexers `.iloc` and `.loc` to access rows and columns from a dataframe. In real life, sometimes we only want to access the value in a single cell. In this case, the fastest way is to use the `.iat` and `at` indexers. We can now tell from the name that `iat` provides **integer**-based lookups while `at` provides **label**-based lookups. \n",
    "\n",
    "Suppose we would like to get the champion country of the 2002 world cup. How do you do that?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the champion country of the 2002 world cup using .at[]\n",
    "wcup.at[2002, 'Champion']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the champion country of the 2002 world cup using .iat[]\n",
    "wcup.iat[5, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:red; display:inline\">Coding Challenge! &lt; / &gt; </h3>\n",
    " \n",
    "You are a middle school teacher. You have a .csv file that stores the English grades and Math grades of the students in your Butterfly class. Can you use a `.read_*()` method to read in the data from the file and create a dataframe? After that, can you use `.iloc[]` or `.loc[]` to get the Math grades of the first three students from the dataframe? \n",
    "After you get the slice of the dataframe with the Math grades of the first three students, can you write the data into a file of `.xlsx` format?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the csv file\n",
    "grades_file = '../All-sample-files/Pandas1_grades.csv'\n",
    "\n",
    "# Read in the data and create a dataframe\n",
    "\n",
    "\n",
    "# Select the Math grades for the first three students\n",
    "\n",
    "\n",
    "# Write the slice into a file of .xlsx format\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set values\n",
    "\n",
    "Now that we know how to access data from a dataframe, we could easily use what we have learned to set the values in a dataframe. To do that, we will use assignment statements you have learned in Python Basics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the data\n",
    "grades = pd.read_csv('../All-sample-files/Pandas1_grades.csv')\n",
    "grades"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use the indexers to set values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that we have learned how to use the indexers to access a slice of a dataframe. We can use the same indexers to change the values in a dataframe. \n",
    "\n",
    "For example, we can get the English grades in the first two rows and change the values to 80. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the English grades in the first two rows and change to 80\n",
    "grades.iloc[:2, 1] = 80"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-08T19:15:53.250163Z",
     "iopub.status.busy": "2023-07-08T19:15:53.249772Z",
     "iopub.status.idle": "2023-07-08T19:15:53.269684Z",
     "shell.execute_reply": "2023-07-08T19:15:53.268858Z",
     "shell.execute_reply.started": "2023-07-08T19:15:53.250127Z"
    }
   },
   "source": [
    "Can you use `loc` to change the math grades in the last two rows to 90?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change the math grades in the last two rows to 90\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And of course, we can get a single value from a dataframe and change it using `iat` or `at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get Jane White's math grade and change it to 85\n",
    "grades.at[3, 'Math'] = 85"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-08T19:19:26.009829Z",
     "iopub.status.busy": "2023-07-08T19:19:26.009123Z",
     "iopub.status.idle": "2023-07-08T19:19:26.026332Z",
     "shell.execute_reply": "2023-07-08T19:19:26.025528Z",
     "shell.execute_reply.started": "2023-07-08T19:19:26.009765Z"
    }
   },
   "source": [
    "It's your turn. Can you get Eve Lynn's English grade and change it to 82? This time, however, use `iat`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get Eve Lynn's math grade and change it to 85\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a new column based on an existing one"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the world cup dataframe again. We add a new column Score to store the scores of the games. Can you set the values in this column to the goals that were scored by the champion in the games?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Add a new column of score\n",
    "score = [\"7-5\", ### put the data in a list\n",
    "         \"4-2\", \n",
    "         \"1-0\", \n",
    "         \"1-0\", \n",
    "         \"6-4\", \n",
    "         \"2-0\", \n",
    "         \"3-0\", \n",
    "         \"3-2\", \n",
    "         \"1-0\", \n",
    "         \"3-2\"]\n",
    "wcup['Score'] = score # make a new column of score\n",
    "wcup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-08T19:39:28.512789Z",
     "iopub.status.busy": "2023-07-08T19:39:28.511726Z",
     "iopub.status.idle": "2023-07-08T19:39:28.531325Z",
     "shell.execute_reply": "2023-07-08T19:39:28.528819Z",
     "shell.execute_reply.started": "2023-07-08T19:39:28.512718Z"
    }
   },
   "source": [
    "In soccer games, it is common to calculate the goals scored and goals conceded by the champion in the final. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a new column 'Goals Scored'\n",
    "wcup['Goals Scored'] = wcup['Score'].str[0]\n",
    "wcup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# create a new column 'Goals Conceded'\n",
    "wcup['Goals Conceded'] = wcup['Score'].str[-1]\n",
    "wcup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the info on goals scored and conceded by the champion, we can create a column containing the difference between the two. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new column 'Difference'\n",
    "wcup['Difference'] = wcup['Goals Scored'] - wcup['Goals Conceded']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get an error! Why? The error message gives us a hint. The minus operator is not defined for the data type str! \n",
    "\n",
    "Luckily, Pandas has a convenient method that allows us to convert data types. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create a new column 'Difference'\n",
    "wcup['Difference'] = wcup['Goals Scored'].astype(int) -  wcup['Goals Conceded'].astype(int)\n",
    "wcup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:red; display:inline\">Coding Challenge! &lt; / &gt; </h3>\n",
    "\n",
    "Can you create two new columns in the grades dataframe, one with the students' first names, one with their last names? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# take a look at the grades df\n",
    "grades"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "## Lesson Complete\n",
    "\n",
    "Congratulations! You have completed *Pandas Basics 1*.\n",
    "\n",
    "### Start Next Lesson: [Pandas 2 ->](./pandas-basics-2.ipynb)\n",
    "\n",
    "### Exercise Solutions\n",
    "Here are a few solutions for exercises in this lesson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Make a dataframe to record English and Math grades of the Butterfly class\n",
    "butterfly = pd.DataFrame({\"Name\": ['John Smith', \n",
    "                              'Alex Hazel', \n",
    "                              'Beatrice Dean', \n",
    "                              'Jane White', \n",
    "                              'Eve Lynn'],                        \n",
    "                         \"English\": [78,\n",
    "                                    80,\n",
    "                                    72,\n",
    "                                    75,\n",
    "                                    73],\n",
    "                         \"Math\": [80,\n",
    "                                 75,\n",
    "                                 95,\n",
    "                                 70,\n",
    "                                 82]\n",
    "                         })\n",
    "butterfly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Get the math grades of the first three students and write the data into an excel file\n",
    "\n",
    "# Get the csv file\n",
    "grades_file = '../All-sample-files/Pandas1_grades.csv'\n",
    "\n",
    "# Read in the data and create a dataframe\n",
    "butterfly = pd.read_csv(grades_file)\n",
    "\n",
    "# Select the Math grades for the first three students\n",
    "butterfly_slice = butterfly.loc[:3, ['Math']]\n",
    "\n",
    "# Write the slice into a file of .xlsx format\n",
    "butterfly_slice.to_excel('./data/butterfly_slice.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Can you create two new columns in the grades dataframe \n",
    "# one with the students' first names, one with their last names\n",
    "grades['First Name'] = grades['Name'].str.split().str[0]\n",
    "grades['Last Name'] = grades['Name'].str.split().str[1]\n",
    "grades"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "278px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}