{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4a87b5ef",
   "metadata": {},
   "source": [
    "---   \n",
    " <img align=\"left\" width=\"75\" height=\"75\"  src=\"https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png\"> \n",
    "\n",
    "<h1 align=\"center\">Department of Data Science</h1>\n",
    "<h1 align=\"center\">Course: Tools and Techniques for Data Science</h1>\n",
    "\n",
    "---\n",
    "<h3><div align=\"right\">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab0dc25c",
   "metadata": {},
   "source": [
    "<h1 align=\"center\">Lecture 3.16 (Pandas-08)</h1>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aa8cebf",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-3-Python-for-Data-Scientists/Lec-3.16(Pandas-08-Handling-Missing-Data).ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19f82705",
   "metadata": {},
   "source": [
    "<img align=\"right\" width=\"400\" height=\"400\"  src=\"images/pandas-apps.png\"  >\n",
    "\n",
    "## _Handling Missing Data.ipynb_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fa27296",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "12db95e1",
   "metadata": {},
   "source": [
    "## Learning agenda of this notebook\n",
    "\n",
    "1. Have an insight about the Dataset\n",
    "2. Identify the Columns having Null/Missing values using `df.isna()` method\n",
    "3. Handle/Impute the Null/Missing Values under the `math` Column using `df.loc[mask,col]=value`\n",
    "4. Handle/Impute the Null/Missing Values under the `group` Column using `df.loc[mask,col]=value`\n",
    "5. Handle Missing values under a Numeric/Categorical Column using `fillna()`\n",
    "6. Handle Repeating Values (for same information) under the `session` Column\n",
    "7. Create a new Column by Modifying an Existing Column\n",
    "8. Delete Rows Having NaN values using `df.dropna()` method\n",
    "9. Convert Categorical Variables into Numerical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0024e3de",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c741f51b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d804305e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f14bdc3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1d2bba7a",
   "metadata": {},
   "source": [
    "## 1. Have an Insight about the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c18ecd6",
   "metadata": {},
   "outputs": [],
   "source": [
    "! cat datasets/group-marks.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7144b3e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the pandas library\n",
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf6ed7b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8653443c",
   "metadata": {},
   "source": [
    "- Whenever the **`pd.read.csv()`** method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) it flags it with NaN. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.\n",
    "- There can also be a scenario that a user has entered some text under a numeric field about which he/she do not have any information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f37c0e6d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "078444a0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e85228e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd80dcff",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20c7fd0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4dab5d98",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9e05576",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e2505bd5",
   "metadata": {},
   "source": [
    "## 2. Identify the Columns having Null/Missing values\n",
    "- The **`df.isna()`** method is recommended to use than `df.isnull()`, which return a boolean same-sized object that indicates whether an element is NA value or not. Missing values get mapped to True. Everything else gets mapped to False values. Remember, characters such as empty strings ``''`` or `numpy.inf` are not considered NA values.\n",
    "- The **`df.notna()`** method is recommended to use than `df.notnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. Non-missing values get mapped to True. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4f6a75c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.math.isna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "406d9230",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7d48904",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1776b52",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[df.math.isna(), :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa2ed31c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isna().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af437830",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.notna().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25b7c141",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we can use sum() on this dataframe object of Boolean values (True is mapped to 1)\n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "966c78ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Similarly, we can use sum() on this dataframe object of Boolean values (True is is mapped to 1)\n",
    "df.notna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c595a48",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7202e219",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cea055f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45e02d50",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "14dd039f",
   "metadata": {},
   "source": [
    "## 3. Handle/Impute the Null/Missing Values under the `math` Column"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "190177e8",
   "metadata": {},
   "source": [
    "### a. Identify the Rows under the `math` Column having Null/Missing values\n",
    "- The `df.isna()` method works equally good on Series objects as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0853c04",
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = df.math.isna()\n",
    "mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88c4e19e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[df.math.isna(), :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72a967c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This will return only those rows of dataframe having null values under the math column\n",
    "df[mask]         # df[df.math.isna()]\n",
    "df.loc[mask, :]  # df.loc[df.math.isna(), :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e49c2c30",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d48768b4",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bf12448",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5d527b7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "552c7ce9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "430a2bed",
   "metadata": {},
   "source": [
    "### b. Replace the Null/Missing Values under the `math` Column\n",
    "- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values under the `math` column\n",
    "- Suppose, we want to put the average values at the place of missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fec9c129",
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c8046ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the mean of math column\n",
    "df.math.mean() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a08ec59",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9254c93",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5940b60b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "ae5eca68",
   "metadata": {},
   "source": [
    "> By seeing the error, it appears that the `math` column do not have the `int64` or `float64` type. Let us check this out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d53fb2ad",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67e345df",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check out the data type of math column\n",
    "df['math'].dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7f1898e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can also use the `df.info()` method to display the count of Non-Null columns, their datatypes, their names \n",
    "# and memory usage of that dataframe.\n",
    "\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6affc95",
   "metadata": {},
   "source": [
    "- **What can be the reason for this?**\n",
    "- Let us check out the values under this column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "739483d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['math']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce23aca2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can replace all such values using the `replace()` method\n",
    "import numpy as np\n",
    "df['math'] = df.math.replace('No Idea', np.nan).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9bef252d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1bc13f55",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.math.replace('No Idea', np.nan, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b6e438c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note the marks of Saadia in math are changed from string `No Idea` to `NaN`\n",
    "# Since this seems working fine let us make inplace=True to make these changes in the original dataframe\n",
    "df.replace('No Idea', np.nan, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a3a4ac7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c17ec0c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af6fb4d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us check the data type of math column\n",
    "df['math'].dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67ec8d4e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35ee44f3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8008ee0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# It is still Object, which is natural, however, we can change the datatype to `df.astype()` method\n",
    "df['math'] = df['math'].astype(float)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ffeba3d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c16b303",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bf25352",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us check the data type of math column\n",
    "df['math'].dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39cb34ae",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de2510bd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "315e10e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us compute the average of math marks again \n",
    "df.math.mean() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19b5ba1d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7df22e86",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List only those records under math column having Null values\n",
    "mask = df.math.isna()\n",
    "df.loc[mask, 'math']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "849e288e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.math.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c21e0ffd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01860b3a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64df7693",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us replace these values with mean value of the math column\n",
    "df.loc[(df.math.isna()),'math'] = df.math.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de8abade",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the result\n",
    "df.isna().sum()\n",
    "#df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d998cc44",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2232212e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87e56a44",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abd8d783",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80408f68",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3819ac9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ac35433",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "70ac58bc",
   "metadata": {},
   "source": [
    "## 4. Handle/Impute the Null/Missing Values under the `group` Column\n",
    "- The `group` column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9ebb9ae",
   "metadata": {},
   "source": [
    "### a. Identify the Rows under the `group` Column having Null/Missing values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bf14128",
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5e93928",
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = df.group.isna()\n",
    "mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7438404",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[mask]          # df[df.group.isna()]\n",
    "df.loc[mask, :]   # df.loc[df.group.isna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fbb9864",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d78b2e0f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cbf038d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b757b2f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "570d6f9e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13ab25aa",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "f638dd05",
   "metadata": {},
   "source": [
    "### b. Replace the Null/Missing Values under the `group` Column\n",
    "- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values\n",
    "- Since this is a categorical column having datatype object (group A, group B, group C, ...), so let us replace it with th value inside the column having the maximum frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bfb3a36",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use value_counts() function which return a Series containing counts of unique values (in descending order)\n",
    "# with the most frequently-occurring element at first. It excludes NA values by default.\n",
    "df.group.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2c21ca5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Another way of doing is use the mode() function on the column\n",
    "df.group.mode() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66acc161",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b65a6998",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd40e2f0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2f362b6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1c578a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List only those records under group column having Null values\n",
    "mask = df.group.isna()\n",
    "df.loc[mask, 'group']     # df.loc[(df.group.isna()), 'group']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6396dd02",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[(df.group.isna()),'group']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2758fefd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us replace these values with maximum occurring value in the `group` column\n",
    "df.loc[(df.group.isna()),'group'] = 'group C'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ef7df81",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the result\n",
    "df.isna().sum()\n",
    "#df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69298cda",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65193323",
   "metadata": {},
   "source": [
    ">Note that in the original dataframe Arifa group information was missing, and now it is `group C` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04fc667a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "348ea0cb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcf6cda5",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cda8f0e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b20a19ef",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c954256",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "406a2755",
   "metadata": {},
   "source": [
    "## 5. Handle Missing values under a Numeric/Categorical Column using `fillna()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "697b16b8",
   "metadata": {},
   "source": [
    "### a. Replace the Null/Missing Values under the math Column using `fillna()`\n",
    "- This is more recommended way of filling in the Null values within columns of your dataset rather than the use of the `loc` method.\n",
    "```\n",
    "object.fillna(value, method, inplace=True)\n",
    "```\n",
    "- The only required argument is either the `value`, with which we want to replace the missing values OR the `method` to be used to replace the missing values\n",
    "- Returns object with missing values filled or None if ``inplace=True``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe077e43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us read the dataset again with NA values under math column\n",
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5e4bb09",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74f2206c",
   "metadata": {},
   "source": [
    ">- Before proceeding, let us this time handle the string value `No Idea` under the math column while reading the csv file, instead of doing afterwards in the dataframe using the `replace()` method as we have done above.\n",
    ">- For this we will use the `na_values` argument to the `pd.read_csv()` method, to which you can pass a single value or a list of values to be replaced with NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b02a535",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc3fd29e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e40e34c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2de4792",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[df.math.isna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b00977e2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed01e7a0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8007a41f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd8541f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This time instead of loc, use fillna() method with just two arguments\n",
    "# inplace=True parameter ensure that this happens in the original dataframe\n",
    "\n",
    "df.math.fillna(value=df.math.mean(), inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60298853",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the result\n",
    "df.isna().sum()\n",
    "#df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "daeac35d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1e84e0d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c54c3d39",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e5204bd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0d6f5d9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "926ec32f",
   "metadata": {},
   "source": [
    "### b. Replace the Null/Missing Values under the `group` Column using `fillna()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad051b97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us read the dataset again with NA values\n",
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d634b9d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "826d9d2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Once again instead of loc,let us use fillna() method with just two arguments\n",
    "\n",
    "df.group.fillna('group C', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b70222f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the result\n",
    "df.isna().sum()\n",
    "#df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e803f22b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d032cb3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06b34c06",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04030de2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0de0de66",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58bf2146",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us fill the math, english and scholarship columns as well again\n",
    "df.math.fillna(df.math.mean(), inplace=True)\n",
    "df.english.fillna(df.english.mean(), inplace=True)\n",
    "df.scholarship.fillna(df.scholarship.mean(), inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c214fda",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the result\n",
    "df.isna().sum()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f66a04e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab2d2d48",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef5478ef",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca660834",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "729361a1",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5e88d9dc",
   "metadata": {},
   "source": [
    "### c. Replace the Null/Missing Values under the` math` and `group` Column using `ffill` and `bfill` Arguments\n",
    "- In above examples, we have used the mean value in case of numeric column and mode value in case of a categorical column as the filling value to the `fillna()` method\n",
    "```\n",
    "object.fillna(value, method, inplace=True)\n",
    "```\n",
    "\n",
    "- We can pass `ffill` or `bfill` as method argument to the `ffillna()` method. This will replace the null values with other values from the DataFrame\n",
    "- `ffill` (Forward fill): It fills the NaN value with the previous value\n",
    "- `bfill` (Back fill): It fills the NaN value with the Next/Upcoming value\n",
    "\n",
    "<img align=\"right\" width=\"490\" height=\"100\"  src=\"images/bfill.png\"  >\n",
    "<img align=\"left\" width=\"490\" height=\"100\"  src=\"images/ffill.png\"  >"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c42ec42e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us read the dataset again with NA values\n",
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')\n",
    "df.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a2edccc",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aa32a02",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5f4d161",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edab7c0d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6967c27d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b0748ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# forward fill or ffill attribute\n",
    "# If have NaN value, just carry forward the previous value\n",
    "# using ffill attribute, you can fill the NaN value with the previous value in that column\n",
    "df.fillna(method = 'ffill', inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29d96cac",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc337636",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0558882",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d49023d8",
   "metadata": {},
   "source": [
    ">Is it working fine?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a4a6277",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4efd6f2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84d8cd2d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b085425",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.fillna(method = 'bfill', inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dca48c0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the result\n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26d21ede",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "521e390e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "788f3da7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89d68fa4",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cf5e126",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77a90898",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0744d3ec",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aaa38f45",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d55c6345",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "849f595d",
   "metadata": {},
   "source": [
    "## 6. Handle Repeating Values (for same information) under the `session` Column\n",
    "- If you observe the values under the `session` column, you can observe that it is a categorical column containing six different categories (as values).\n",
    "    - Notice that the categories `MORNING` and `MOR` are same\n",
    "    - Similarly, `AFTERNOON` and `AFT` are same\n",
    "    - Similarly, `EVENING` and `EVE` are same\n",
    "- This happens when you have collected data from different sources, where same information is written in different ways\n",
    "- So the `session` column has six different categories (as values) but should have only three"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39378b47",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv' )\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a11aa55",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6266ae48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let use check out the counts of unique values inside the session Column\n",
    "df.session.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4bebb2c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.session.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68e20639",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e160c362",
   "metadata": {},
   "source": [
    "###  Handle  the Repeating Values under the session Column using `map()`\n",
    "- To keep the data clean we will map all these values to only three categories to `MOR` , `AFT` and `EVE` using the map() function.\n",
    "```\n",
    "df.map(mapping, na_action=None)\n",
    "```\n",
    "- The `map()` method is used for substituting each value in a Series with another value, that may be derived from a `dict`. The `map()` method returns a series after performing the mapping\n",
    "- You can give `ignore` as second argument which will propagate NaN values, without passing them to the mapping correspondence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a258c9b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do this, let us create a new mapping (dictionary) \n",
    "dict1 = {\n",
    "    'MORNING' : 'MOR',\n",
    "    'MOR' : 'MOR',\n",
    "    'AFTERNOON' : 'AFT',\n",
    "    'AFT': 'AFT',\n",
    "    'EVENING' : 'EVE',\n",
    "    'EVE': 'EVE'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d85f45f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# It returns a series with the same index as caller, the original series remains unchanged. \n",
    "# So we have assigned the resulting series to `df.session` series\n",
    "df.session.map(dict1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd8c5c77",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.session = df.session.map(dict1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20810ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count of new categories in the column session\n",
    "# Observe we have managed to properly manage the values inside the session column\n",
    "df.session.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "614f17db",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us verify the result\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fe79ea0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b1ef03b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92af0ed0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed4a973d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f7cc250",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b07d53e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "f0d90de5",
   "metadata": {},
   "source": [
    "## 7. Create a new Column by Modifying an Existing Column\n",
    "- We have a column scholarship in the dataset, which is in Pak Rupees\n",
    "- Suppose you want to have a new column which should represent the scholarship in US Dollars\n",
    "- For that we need to add a new column by dividing each value of scholarship with 150"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c3133b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv' )\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a164db22",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.scholarship.apply(lambda x: x/170)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0df8d4b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/150)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e579102d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c9328f6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85c26466",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dc33e00",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3864967f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33eb8118",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "88eea76d",
   "metadata": {},
   "source": [
    "## 8. Delete Rows Having NaN values using `df.dropna()` method\n",
    "**```\n",
    "df.dropna(axis, how, subset, inplace)\n",
    "```**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef83e54e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b226513f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddac322b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use dropna() method to drop all the rows, it it has any na value\n",
    "df1 = df.dropna()\n",
    "df1.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23e91e38",
   "metadata": {},
   "outputs": [],
   "source": [
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b5d7f58",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b1c5d46",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Default Arguments to dropna()\n",
    "df2 = df.dropna(axis=0, how='any')\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5776a24c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae929bfd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33388c27",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37de9061",
   "metadata": {},
   "outputs": [],
   "source": [
    "# If we set how='all` it means drop a row only if all of its values are NA\n",
    "df2 = df.dropna(axis=0, how='all')\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb454e9c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd313b95",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3e33944",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e9cc0b3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a45c8b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use of subset argument and pass it a list of columns based on whose values you want to drop a row\n",
    "df2 = df.dropna(axis=0, how='any', subset=['math'])\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49dc8ff7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b40e73a8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98ea1093",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c409a382",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use of subset argument\n",
    "df2 = df.dropna(axis=0, how='any', subset=['session'])\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59a3d051",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d6dbf74",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81cf8668",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24afcf1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Having `how=all` and `subset=listofcolumnnames`, then it will \n",
    "# drop a row only if both the columns have a NA value in that row\n",
    "df2 = df.dropna(axis=0, how='any', subset=['math', 'session'])\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "973223ef",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c043826",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebedf621",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "718b79a9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f13d514a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# If we set the axis=1 and how=all, it means drop a column if all the  values under it is na\n",
    "df2 = df.dropna(axis=1, how='all')\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec9262c2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "377e62b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8941aec9",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = df.dropna(axis=1, how='any')\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b104da0e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c779dfff",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b319edee",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee863cea",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94a5c257",
   "metadata": {},
   "outputs": [],
   "source": [
    "# If we set the axis=1 and how=any, it means drop a column if any value under it is na\n",
    "df2 = df.dropna(axis=1, how='any')\n",
    "df2.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf1c032c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bc00056",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae31c28f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "864c9378",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4e38815",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24231c5e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d527fdbe",
   "metadata": {},
   "source": [
    "## 9. Convert Categorical Variables into Numerical\n",
    "- Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. \n",
    "- We can do this using Pandas function `pd.get_dummies()`, which will create a binary column for each of the categories. \n",
    "```\n",
    "pd.get_dummies(data, drop_first=False)\n",
    "```\n",
    "- Where, the only required argument is `data` which can be a dataframe or a series\n",
    "- The parameter drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.\n",
    "\n",
    "**Note:** Making a dummy variable will take all the `K` distinct values in one coumn and make `K` columns out of them"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c52c50a3",
   "metadata": {},
   "source": [
    "### a. Convert all categorical variables into dummy/indicator variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36e83961",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddb9d493",
   "metadata": {},
   "outputs": [],
   "source": [
    "# currently we have 10 columns in the data\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70d3c674",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c081ee33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert all categorical variables into dummy/indicator variables\n",
    "df = pd.get_dummies(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23c9ae78",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us view the datafreame, keep a note on the number of columns\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ed4ca5c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# The Number of columns has gone to 1605 now\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d766891",
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb94e433",
   "metadata": {},
   "source": [
    "- So we have 112 columns\n",
    "- It adds a lot of dimensionality to your data, i.e., increase the number of columns\n",
    "- It also become difficult to deal with that much number of columns\n",
    "- This is a trade-off, which is handled by technique called dimensionality reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a956c709",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "837e3f52",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c8064be",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "bf725cc0",
   "metadata": {},
   "source": [
    "### b. Perform One-Hot Encoding for Categorical Column `gender` Only\n",
    "- In our dataframe, the gender column is a categorical column having two values 'male' and 'female'\n",
    "- It will create a dummy binary columns.  \n",
    "- This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c2f41e74",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rollno</th>\n",
       "      <th>name</th>\n",
       "      <th>gender</th>\n",
       "      <th>group</th>\n",
       "      <th>session</th>\n",
       "      <th>age</th>\n",
       "      <th>scholarship</th>\n",
       "      <th>math</th>\n",
       "      <th>english</th>\n",
       "      <th>urdu</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MS01</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MORNING</td>\n",
       "      <td>28</td>\n",
       "      <td>2562</td>\n",
       "      <td>No Idea</td>\n",
       "      <td>72.0</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MS02</td>\n",
       "      <td>JUMAIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>69</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MS03</td>\n",
       "      <td>ARIFA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>95.0</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MS04</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>MOR</td>\n",
       "      <td>44</td>\n",
       "      <td>2000</td>\n",
       "      <td>47</td>\n",
       "      <td>57.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MS05</td>\n",
       "      <td>DANISH</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>54</td>\n",
       "      <td>2100</td>\n",
       "      <td>76</td>\n",
       "      <td>78.0</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  rollno     name  gender    group    session  age  scholarship     math  \\\n",
       "0   MS01   SAADIA  female  group B    MORNING   28         2562  No Idea   \n",
       "1   MS02  JUMAIMA  female  group C  AFTERNOON   33         2800       69   \n",
       "2   MS03    ARIFA  female      NaN    EVENING   34         3500      NaN   \n",
       "3   MS04   SAADIA  female  group A        MOR   44         2000       47   \n",
       "4   MS05   DANISH    male  group C  AFTERNOON   54         2100       76   \n",
       "\n",
       "   english  urdu  \n",
       "0     72.0    74  \n",
       "1     90.0    88  \n",
       "2     95.0    93  \n",
       "3     57.0    44  \n",
       "4     78.0    55  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df1 = pd.read_csv('datasets/group-marks.csv')\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb554eba",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "058940b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert only gender variable into dummy/indicator variables\n",
    "df2 = pd.get_dummies(df1[['gender']])\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a111866",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d07731b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Since we donot need two separate columns, so simply use the `drop_first` argument of get_dummies to handle this\n",
    "df2 = pd.get_dummies(df1[['gender']], drop_first=True)\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48e7a40e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5633aaec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will talk about join in the next session in detail.\n",
    "df3 = df1.join(df2['gender_male'])\n",
    "df3.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17c7f244",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "218df74c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bee19d23",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f25950c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "684dbde3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rollno</th>\n",
       "      <th>name</th>\n",
       "      <th>gender</th>\n",
       "      <th>group</th>\n",
       "      <th>session</th>\n",
       "      <th>age</th>\n",
       "      <th>scholarship</th>\n",
       "      <th>math</th>\n",
       "      <th>english</th>\n",
       "      <th>urdu</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MS01</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MORNING</td>\n",
       "      <td>28</td>\n",
       "      <td>2562</td>\n",
       "      <td>No Idea</td>\n",
       "      <td>72.0</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MS02</td>\n",
       "      <td>JUMAIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>69</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MS03</td>\n",
       "      <td>ARIFA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>95.0</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MS04</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>MOR</td>\n",
       "      <td>44</td>\n",
       "      <td>2000</td>\n",
       "      <td>47</td>\n",
       "      <td>57.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MS05</td>\n",
       "      <td>DANISH</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>54</td>\n",
       "      <td>2100</td>\n",
       "      <td>76</td>\n",
       "      <td>78.0</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  rollno     name  gender    group    session  age  scholarship     math  \\\n",
       "0   MS01   SAADIA  female  group B    MORNING   28         2562  No Idea   \n",
       "1   MS02  JUMAIMA  female  group C  AFTERNOON   33         2800       69   \n",
       "2   MS03    ARIFA  female      NaN    EVENING   34         3500      NaN   \n",
       "3   MS04   SAADIA  female  group A        MOR   44         2000       47   \n",
       "4   MS05   DANISH    male  group C  AFTERNOON   54         2100       76   \n",
       "\n",
       "   english  urdu  \n",
       "0     72.0    74  \n",
       "1     90.0    88  \n",
       "2     95.0    93  \n",
       "3     57.0    44  \n",
       "4     78.0    55  "
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('datasets/group-marks.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "dede203f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.session.value_counts()\n",
    "dict1 = {\n",
    "    'MORNING' : 'MOR',\n",
    "    'AFTERNOON' : 'AFT',\n",
    "}\n",
    "#df.session = df.session.map(dict1)\n",
    "df.session = df.session.replace(dict1)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "a5aa5677",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rollno</th>\n",
       "      <th>name</th>\n",
       "      <th>gender</th>\n",
       "      <th>group</th>\n",
       "      <th>session</th>\n",
       "      <th>age</th>\n",
       "      <th>scholarship</th>\n",
       "      <th>math</th>\n",
       "      <th>english</th>\n",
       "      <th>urdu</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MS01</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MOR</td>\n",
       "      <td>28</td>\n",
       "      <td>2562</td>\n",
       "      <td>No Idea</td>\n",
       "      <td>72.0</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MS02</td>\n",
       "      <td>JUMAIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFT</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>69</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MS03</td>\n",
       "      <td>ARIFA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>95.0</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MS04</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>MOR</td>\n",
       "      <td>44</td>\n",
       "      <td>2000</td>\n",
       "      <td>47</td>\n",
       "      <td>57.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MS05</td>\n",
       "      <td>DANISH</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFT</td>\n",
       "      <td>54</td>\n",
       "      <td>2100</td>\n",
       "      <td>76</td>\n",
       "      <td>78.0</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>MS06</td>\n",
       "      <td>SAFIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>AFT</td>\n",
       "      <td>23</td>\n",
       "      <td>3800</td>\n",
       "      <td>NaN</td>\n",
       "      <td>83.0</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>MS07</td>\n",
       "      <td>SARA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>47</td>\n",
       "      <td>3000</td>\n",
       "      <td>88</td>\n",
       "      <td>95.0</td>\n",
       "      <td>92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>MS08</td>\n",
       "      <td>ABDULLAH</td>\n",
       "      <td>male</td>\n",
       "      <td>group B</td>\n",
       "      <td>EVE</td>\n",
       "      <td>33</td>\n",
       "      <td>2000</td>\n",
       "      <td>40</td>\n",
       "      <td>43.0</td>\n",
       "      <td>39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>MS09</td>\n",
       "      <td>KHAN</td>\n",
       "      <td>male</td>\n",
       "      <td>group D</td>\n",
       "      <td>MOR</td>\n",
       "      <td>27</td>\n",
       "      <td>2500</td>\n",
       "      <td>64</td>\n",
       "      <td>NaN</td>\n",
       "      <td>67</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>MS10</td>\n",
       "      <td>HASEENA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>AFT</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>38</td>\n",
       "      <td>60.0</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>MS11</td>\n",
       "      <td>MUSTJAB</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>MOR</td>\n",
       "      <td>46</td>\n",
       "      <td>3000</td>\n",
       "      <td>58</td>\n",
       "      <td>54.0</td>\n",
       "      <td>52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>MS12</td>\n",
       "      <td>ABRAR</td>\n",
       "      <td>male</td>\n",
       "      <td>group D</td>\n",
       "      <td>MOR</td>\n",
       "      <td>53</td>\n",
       "      <td>3312</td>\n",
       "      <td>40</td>\n",
       "      <td>52.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>MS13</td>\n",
       "      <td>MAHOOR</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>MOR</td>\n",
       "      <td>25</td>\n",
       "      <td>2345</td>\n",
       "      <td>65</td>\n",
       "      <td>81.0</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>MS14</td>\n",
       "      <td>USAMA</td>\n",
       "      <td>male</td>\n",
       "      <td>group A</td>\n",
       "      <td>AFT</td>\n",
       "      <td>26</td>\n",
       "      <td>2654</td>\n",
       "      <td>78</td>\n",
       "      <td>72.0</td>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>MS15</td>\n",
       "      <td>NAVAIRA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>AFT</td>\n",
       "      <td>25</td>\n",
       "      <td>2137</td>\n",
       "      <td>50</td>\n",
       "      <td>53.0</td>\n",
       "      <td>58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>MS16</td>\n",
       "      <td>SAWAIRA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>29</td>\n",
       "      <td>2567</td>\n",
       "      <td>69</td>\n",
       "      <td>75.0</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>MS17</td>\n",
       "      <td>NOFIL</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>MOR</td>\n",
       "      <td>22</td>\n",
       "      <td>3500</td>\n",
       "      <td>88</td>\n",
       "      <td>89.0</td>\n",
       "      <td>86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>MS18</td>\n",
       "      <td>SHUMAILA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>AFT</td>\n",
       "      <td>31</td>\n",
       "      <td>2500</td>\n",
       "      <td>18</td>\n",
       "      <td>NaN</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>MS19</td>\n",
       "      <td>ABUBAKAR</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFT</td>\n",
       "      <td>33</td>\n",
       "      <td>3000</td>\n",
       "      <td>46</td>\n",
       "      <td>42.0</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>MS20</td>\n",
       "      <td>NADRA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>MOR</td>\n",
       "      <td>36</td>\n",
       "      <td>2200</td>\n",
       "      <td>54</td>\n",
       "      <td>58.0</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>MS21</td>\n",
       "      <td>HASEEB</td>\n",
       "      <td>male</td>\n",
       "      <td>group D</td>\n",
       "      <td>AFT</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>66</td>\n",
       "      <td>69.0</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>MS22</td>\n",
       "      <td>HOORIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>32</td>\n",
       "      <td>2000</td>\n",
       "      <td>65</td>\n",
       "      <td>75.0</td>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>MS23</td>\n",
       "      <td>RAUF</td>\n",
       "      <td>male</td>\n",
       "      <td>group D</td>\n",
       "      <td>MOR</td>\n",
       "      <td>31</td>\n",
       "      <td>2500</td>\n",
       "      <td>44</td>\n",
       "      <td>54.0</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>MS24</td>\n",
       "      <td>LAIBA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFT</td>\n",
       "      <td>37</td>\n",
       "      <td>3000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>73.0</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>MS25</td>\n",
       "      <td>FATIMA</td>\n",
       "      <td>male</td>\n",
       "      <td>group D</td>\n",
       "      <td>AFT</td>\n",
       "      <td>33</td>\n",
       "      <td>2500</td>\n",
       "      <td>74</td>\n",
       "      <td>71.0</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>MS26</td>\n",
       "      <td>IBRAR</td>\n",
       "      <td>male</td>\n",
       "      <td>group A</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>39</td>\n",
       "      <td>3500</td>\n",
       "      <td>73</td>\n",
       "      <td>74.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>MS27</td>\n",
       "      <td>USAMA</td>\n",
       "      <td>male</td>\n",
       "      <td>group B</td>\n",
       "      <td>EVE</td>\n",
       "      <td>32</td>\n",
       "      <td>2500</td>\n",
       "      <td>69</td>\n",
       "      <td>54.0</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>MS28</td>\n",
       "      <td>NOOR</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>MOR</td>\n",
       "      <td>47</td>\n",
       "      <td>3000</td>\n",
       "      <td>67</td>\n",
       "      <td>69.0</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>MS29</td>\n",
       "      <td>UMAR</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFT</td>\n",
       "      <td>44</td>\n",
       "      <td>4000</td>\n",
       "      <td>70</td>\n",
       "      <td>70.0</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>MS30</td>\n",
       "      <td>ROSHAN</td>\n",
       "      <td>female</td>\n",
       "      <td>group D</td>\n",
       "      <td>MOR</td>\n",
       "      <td>42</td>\n",
       "      <td>3500</td>\n",
       "      <td>62</td>\n",
       "      <td>70.0</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>MS31</td>\n",
       "      <td>UNAIZA</td>\n",
       "      <td>female</td>\n",
       "      <td>group D</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>41</td>\n",
       "      <td>2500</td>\n",
       "      <td>69</td>\n",
       "      <td>74.0</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>MS32</td>\n",
       "      <td>FATIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>EVE</td>\n",
       "      <td>19</td>\n",
       "      <td>3000</td>\n",
       "      <td>63</td>\n",
       "      <td>NaN</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>MS33</td>\n",
       "      <td>SHAISTA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>MOR</td>\n",
       "      <td>29</td>\n",
       "      <td>3500</td>\n",
       "      <td>56</td>\n",
       "      <td>72.0</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>MS34</td>\n",
       "      <td>ALI</td>\n",
       "      <td>male</td>\n",
       "      <td>group D</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3000</td>\n",
       "      <td>40</td>\n",
       "      <td>42.0</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>MS35</td>\n",
       "      <td>ABDULLAH</td>\n",
       "      <td>male</td>\n",
       "      <td>group E</td>\n",
       "      <td>MOR</td>\n",
       "      <td>45</td>\n",
       "      <td>2500</td>\n",
       "      <td>97</td>\n",
       "      <td>87.0</td>\n",
       "      <td>82</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>MS36</td>\n",
       "      <td>OSAMA</td>\n",
       "      <td>male</td>\n",
       "      <td>group E</td>\n",
       "      <td>AFT</td>\n",
       "      <td>31</td>\n",
       "      <td>3500</td>\n",
       "      <td>81</td>\n",
       "      <td>81.0</td>\n",
       "      <td>79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>MS37</td>\n",
       "      <td>AMBREEN</td>\n",
       "      <td>female</td>\n",
       "      <td>group D</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>28</td>\n",
       "      <td>3000</td>\n",
       "      <td>74</td>\n",
       "      <td>81.0</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>MS38</td>\n",
       "      <td>NASEEM</td>\n",
       "      <td>female</td>\n",
       "      <td>group D</td>\n",
       "      <td>MOR</td>\n",
       "      <td>26</td>\n",
       "      <td>2500</td>\n",
       "      <td>50</td>\n",
       "      <td>64.0</td>\n",
       "      <td>59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>MS39</td>\n",
       "      <td>SAFIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group D</td>\n",
       "      <td>AFT</td>\n",
       "      <td>36</td>\n",
       "      <td>3500</td>\n",
       "      <td>75</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>MS40</td>\n",
       "      <td>NABEEL</td>\n",
       "      <td>male</td>\n",
       "      <td>group B</td>\n",
       "      <td>AFT</td>\n",
       "      <td>33</td>\n",
       "      <td>3000</td>\n",
       "      <td>57</td>\n",
       "      <td>56.0</td>\n",
       "      <td>57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>MS41</td>\n",
       "      <td>SIKANDER</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>35</td>\n",
       "      <td>2500</td>\n",
       "      <td>55</td>\n",
       "      <td>61.0</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>MS42</td>\n",
       "      <td>ZAIN</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>EVE</td>\n",
       "      <td>27</td>\n",
       "      <td>3500</td>\n",
       "      <td>58</td>\n",
       "      <td>73.0</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>MS43</td>\n",
       "      <td>AYESHA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MOR</td>\n",
       "      <td>38</td>\n",
       "      <td>3000</td>\n",
       "      <td>53</td>\n",
       "      <td>58.0</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>MS44</td>\n",
       "      <td>HASSAN</td>\n",
       "      <td>male</td>\n",
       "      <td>group B</td>\n",
       "      <td>AFT</td>\n",
       "      <td>31</td>\n",
       "      <td>2500</td>\n",
       "      <td>59</td>\n",
       "      <td>65.0</td>\n",
       "      <td>66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>MS45</td>\n",
       "      <td>ZAINAB</td>\n",
       "      <td>female</td>\n",
       "      <td>group E</td>\n",
       "      <td>MOR</td>\n",
       "      <td>28</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>56.0</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>MS46</td>\n",
       "      <td>TALHA</td>\n",
       "      <td>male</td>\n",
       "      <td>group B</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>41</td>\n",
       "      <td>3000</td>\n",
       "      <td>65</td>\n",
       "      <td>54.0</td>\n",
       "      <td>57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>MS47</td>\n",
       "      <td>SABA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>EVE</td>\n",
       "      <td>36</td>\n",
       "      <td>2500</td>\n",
       "      <td>55</td>\n",
       "      <td>65.0</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>MS48</td>\n",
       "      <td>RAFIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>MOR</td>\n",
       "      <td>30</td>\n",
       "      <td>3500</td>\n",
       "      <td>66</td>\n",
       "      <td>71.0</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>MS49</td>\n",
       "      <td>FATIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group D</td>\n",
       "      <td>MOR</td>\n",
       "      <td>40</td>\n",
       "      <td>2500</td>\n",
       "      <td>57</td>\n",
       "      <td>74.0</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>MS50</td>\n",
       "      <td>KAKAMANNA</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFT</td>\n",
       "      <td>37</td>\n",
       "      <td>3000</td>\n",
       "      <td>66</td>\n",
       "      <td>78.0</td>\n",
       "      <td>81</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rollno       name  gender    group  session  age  scholarship     math  \\\n",
       "0    MS01     SAADIA  female  group B      MOR   28         2562  No Idea   \n",
       "1    MS02    JUMAIMA  female  group C      AFT   33         2800       69   \n",
       "2    MS03      ARIFA  female      NaN  EVENING   34         3500      NaN   \n",
       "3    MS04     SAADIA  female  group A      MOR   44         2000       47   \n",
       "4    MS05     DANISH    male  group C      AFT   54         2100       76   \n",
       "5    MS06      SAFIA  female  group B      AFT   23         3800      NaN   \n",
       "6    MS07       SARA  female  group B  EVENING   47         3000       88   \n",
       "7    MS08   ABDULLAH    male  group B      EVE   33         2000       40   \n",
       "8    MS09       KHAN    male  group D      MOR   27         2500       64   \n",
       "9    MS10    HASEENA  female  group B      AFT   33         2800       38   \n",
       "10   MS11    MUSTJAB    male  group C      MOR   46         3000       58   \n",
       "11   MS12      ABRAR    male  group D      MOR   53         3312       40   \n",
       "12   MS13     MAHOOR  female      NaN      MOR   25         2345       65   \n",
       "13   MS14      USAMA    male  group A      AFT   26         2654       78   \n",
       "14   MS15    NAVAIRA  female  group A      AFT   25         2137       50   \n",
       "15   MS16    SAWAIRA  female  group C  EVENING   29         2567       69   \n",
       "16   MS17      NOFIL    male  group C      MOR   22         3500       88   \n",
       "17   MS18   SHUMAILA  female  group B      AFT   31         2500       18   \n",
       "18   MS19   ABUBAKAR    male  group C      AFT   33         3000       46   \n",
       "19   MS20      NADRA  female  group C      MOR   36         2200       54   \n",
       "20   MS21     HASEEB    male  group D      AFT   34         3500       66   \n",
       "21   MS22     HOORIA  female  group B  EVENING   32         2000       65   \n",
       "22   MS23       RAUF    male  group D      MOR   31         2500       44   \n",
       "23   MS24      LAIBA  female  group C      AFT   37         3000      NaN   \n",
       "24   MS25     FATIMA    male  group D      AFT   33         2500       74   \n",
       "25   MS26      IBRAR    male  group A  EVENING   39         3500       73   \n",
       "26   MS27      USAMA    male  group B      EVE   32         2500       69   \n",
       "27   MS28       NOOR  female  group C      MOR   47         3000       67   \n",
       "28   MS29       UMAR    male  group C      AFT   44         4000       70   \n",
       "29   MS30     ROSHAN  female  group D      MOR   42         3500       62   \n",
       "30   MS31     UNAIZA  female  group D  EVENING   41         2500       69   \n",
       "31   MS32     FATIMA  female  group B      EVE   19         3000       63   \n",
       "32   MS33    SHAISTA  female      NaN      MOR   29         3500       56   \n",
       "33   MS34        ALI    male  group D  EVENING   34         3000       40   \n",
       "34   MS35   ABDULLAH    male  group E      MOR   45         2500       97   \n",
       "35   MS36      OSAMA    male  group E      AFT   31         3500       81   \n",
       "36   MS37    AMBREEN  female  group D  EVENING   28         3000       74   \n",
       "37   MS38     NASEEM  female  group D      MOR   26         2500       50   \n",
       "38   MS39      SAFIA  female  group D      AFT   36         3500       75   \n",
       "39   MS40     NABEEL    male  group B      AFT   33         3000       57   \n",
       "40   MS41   SIKANDER    male  group C  EVENING   35         2500       55   \n",
       "41   MS42       ZAIN  female  group C      EVE   27         3500       58   \n",
       "42   MS43     AYESHA  female  group B      MOR   38         3000       53   \n",
       "43   MS44     HASSAN    male  group B      AFT   31         2500       59   \n",
       "44   MS45     ZAINAB  female  group E      MOR   28         3500      NaN   \n",
       "45   MS46      TALHA    male  group B  EVENING   41         3000       65   \n",
       "46   MS47       SABA  female  group A      EVE   36         2500       55   \n",
       "47   MS48      RAFIA  female  group C      MOR   30         3500       66   \n",
       "48   MS49     FATIMA  female  group D      MOR   40         2500       57   \n",
       "49   MS50  KAKAMANNA    male  group C      AFT   37         3000       66   \n",
       "\n",
       "    english  urdu  \n",
       "0      72.0    74  \n",
       "1      90.0    88  \n",
       "2      95.0    93  \n",
       "3      57.0    44  \n",
       "4      78.0    55  \n",
       "5      83.0    78  \n",
       "6      95.0    92  \n",
       "7      43.0    39  \n",
       "8       NaN    67  \n",
       "9      60.0    50  \n",
       "10     54.0    52  \n",
       "11     52.0    43  \n",
       "12     81.0    73  \n",
       "13     72.0    70  \n",
       "14     53.0    58  \n",
       "15     75.0    78  \n",
       "16     89.0    86  \n",
       "17      NaN    28  \n",
       "18     42.0    46  \n",
       "19     58.0    45  \n",
       "20     69.0    63  \n",
       "21     75.0    70  \n",
       "22     54.0    53  \n",
       "23     73.0    73  \n",
       "24     71.0    80  \n",
       "25     74.0    72  \n",
       "26     54.0    55  \n",
       "27     69.0    75  \n",
       "28     70.0    65  \n",
       "29     70.0    75  \n",
       "30     74.0    74  \n",
       "31      NaN    61  \n",
       "32     72.0    65  \n",
       "33     42.0    38  \n",
       "34     87.0    82  \n",
       "35     81.0    79  \n",
       "36     81.0    62  \n",
       "37     64.0    59  \n",
       "38     90.0    88  \n",
       "39     56.0    57  \n",
       "40     61.0    54  \n",
       "41     73.0    68  \n",
       "42     58.0    65  \n",
       "43     65.0    66  \n",
       "44     56.0    54  \n",
       "45     54.0    57  \n",
       "46     65.0    62  \n",
       "47     71.0    76  \n",
       "48     74.0    76  \n",
       "49     78.0    81  "
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4223b274",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9341d7a6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "ad4df941",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AFT</th>\n",
       "      <th>AFTERNOON</th>\n",
       "      <th>EVE</th>\n",
       "      <th>EVENING</th>\n",
       "      <th>MOR</th>\n",
       "      <th>MORNING</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   AFT  AFTERNOON  EVE  EVENING  MOR  MORNING\n",
       "0    0          0    0        0    0        1\n",
       "1    0          1    0        0    0        0\n",
       "2    0          0    0        1    0        0\n",
       "3    0          0    0        0    1        0\n",
       "4    0          1    0        0    0        0"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.get_dummies(df.session)\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26a9a4e2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f235f8c5",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39a3b975",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "a84648f2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rollno</th>\n",
       "      <th>name</th>\n",
       "      <th>gender</th>\n",
       "      <th>group</th>\n",
       "      <th>session</th>\n",
       "      <th>age</th>\n",
       "      <th>scholarship</th>\n",
       "      <th>math</th>\n",
       "      <th>english</th>\n",
       "      <th>urdu</th>\n",
       "      <th>male</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MS01</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MORNING</td>\n",
       "      <td>28</td>\n",
       "      <td>2562</td>\n",
       "      <td>No Idea</td>\n",
       "      <td>72.0</td>\n",
       "      <td>74</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MS02</td>\n",
       "      <td>JUMAIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>69</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MS03</td>\n",
       "      <td>ARIFA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>95.0</td>\n",
       "      <td>93</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MS04</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>MOR</td>\n",
       "      <td>44</td>\n",
       "      <td>2000</td>\n",
       "      <td>47</td>\n",
       "      <td>57.0</td>\n",
       "      <td>44</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MS05</td>\n",
       "      <td>DANISH</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>54</td>\n",
       "      <td>2100</td>\n",
       "      <td>76</td>\n",
       "      <td>78.0</td>\n",
       "      <td>55</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  rollno     name  gender    group    session  age  scholarship     math  \\\n",
       "0   MS01   SAADIA  female  group B    MORNING   28         2562  No Idea   \n",
       "1   MS02  JUMAIMA  female  group C  AFTERNOON   33         2800       69   \n",
       "2   MS03    ARIFA  female      NaN    EVENING   34         3500      NaN   \n",
       "3   MS04   SAADIA  female  group A        MOR   44         2000       47   \n",
       "4   MS05   DANISH    male  group C  AFTERNOON   54         2100       76   \n",
       "\n",
       "   english  urdu  male  \n",
       "0     72.0    74     0  \n",
       "1     90.0    88     0  \n",
       "2     95.0    93     0  \n",
       "3     57.0    44     0  \n",
       "4     78.0    55     1  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.get_dummies(df.gender, drop_first=True)\n",
    "df3 = df.join(df1['male'])\n",
    "df3.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7035f8e1",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1028a381",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "d0fbbba0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rollno</th>\n",
       "      <th>name</th>\n",
       "      <th>gender</th>\n",
       "      <th>group</th>\n",
       "      <th>session</th>\n",
       "      <th>age</th>\n",
       "      <th>scholarship</th>\n",
       "      <th>math</th>\n",
       "      <th>english</th>\n",
       "      <th>urdu</th>\n",
       "      <th>gender_male</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MS01</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MORNING</td>\n",
       "      <td>28</td>\n",
       "      <td>2562</td>\n",
       "      <td>No Idea</td>\n",
       "      <td>72.0</td>\n",
       "      <td>74</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MS02</td>\n",
       "      <td>JUMAIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>69</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MS03</td>\n",
       "      <td>ARIFA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>95.0</td>\n",
       "      <td>93</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MS04</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>MOR</td>\n",
       "      <td>44</td>\n",
       "      <td>2000</td>\n",
       "      <td>47</td>\n",
       "      <td>57.0</td>\n",
       "      <td>44</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MS05</td>\n",
       "      <td>DANISH</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>54</td>\n",
       "      <td>2100</td>\n",
       "      <td>76</td>\n",
       "      <td>78.0</td>\n",
       "      <td>55</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  rollno     name  gender    group    session  age  scholarship     math  \\\n",
       "0   MS01   SAADIA  female  group B    MORNING   28         2562  No Idea   \n",
       "1   MS02  JUMAIMA  female  group C  AFTERNOON   33         2800       69   \n",
       "2   MS03    ARIFA  female      NaN    EVENING   34         3500      NaN   \n",
       "3   MS04   SAADIA  female  group A        MOR   44         2000       47   \n",
       "4   MS05   DANISH    male  group C  AFTERNOON   54         2100       76   \n",
       "\n",
       "   english  urdu  gender_male  \n",
       "0     72.0    74            0  \n",
       "1     90.0    88            0  \n",
       "2     95.0    93            0  \n",
       "3     57.0    44            0  \n",
       "4     78.0    55            1  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.get_dummies(df[['gender']], drop_first=True)\n",
    "df3 = df.join(df1['gender_male'])\n",
    "df3.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1ef03d88",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rollno</th>\n",
       "      <th>name</th>\n",
       "      <th>gender</th>\n",
       "      <th>group</th>\n",
       "      <th>session</th>\n",
       "      <th>age</th>\n",
       "      <th>scholarship</th>\n",
       "      <th>math</th>\n",
       "      <th>english</th>\n",
       "      <th>urdu</th>\n",
       "      <th>gender_male</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MS01</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group B</td>\n",
       "      <td>MORNING</td>\n",
       "      <td>28</td>\n",
       "      <td>2562</td>\n",
       "      <td>No Idea</td>\n",
       "      <td>72.0</td>\n",
       "      <td>74</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MS02</td>\n",
       "      <td>JUMAIMA</td>\n",
       "      <td>female</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>33</td>\n",
       "      <td>2800</td>\n",
       "      <td>69</td>\n",
       "      <td>90.0</td>\n",
       "      <td>88</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MS03</td>\n",
       "      <td>ARIFA</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>EVENING</td>\n",
       "      <td>34</td>\n",
       "      <td>3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>95.0</td>\n",
       "      <td>93</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MS04</td>\n",
       "      <td>SAADIA</td>\n",
       "      <td>female</td>\n",
       "      <td>group A</td>\n",
       "      <td>MOR</td>\n",
       "      <td>44</td>\n",
       "      <td>2000</td>\n",
       "      <td>47</td>\n",
       "      <td>57.0</td>\n",
       "      <td>44</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MS05</td>\n",
       "      <td>DANISH</td>\n",
       "      <td>male</td>\n",
       "      <td>group C</td>\n",
       "      <td>AFTERNOON</td>\n",
       "      <td>54</td>\n",
       "      <td>2100</td>\n",
       "      <td>76</td>\n",
       "      <td>78.0</td>\n",
       "      <td>55</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  rollno     name  gender    group    session  age  scholarship     math  \\\n",
       "0   MS01   SAADIA  female  group B    MORNING   28         2562  No Idea   \n",
       "1   MS02  JUMAIMA  female  group C  AFTERNOON   33         2800       69   \n",
       "2   MS03    ARIFA  female      NaN    EVENING   34         3500      NaN   \n",
       "3   MS04   SAADIA  female  group A        MOR   44         2000       47   \n",
       "4   MS05   DANISH    male  group C  AFTERNOON   54         2100       76   \n",
       "\n",
       "   english  urdu  gender_male  \n",
       "0     72.0    74            0  \n",
       "1     90.0    88            0  \n",
       "2     95.0    93            0  \n",
       "3     57.0    44            0  \n",
       "4     78.0    55            1  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3 = df.join(df1['gender_male'])\n",
    "df3.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25cdfe07",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}