{ "cells": [ { "cell_type": "markdown", "id": "4a87b5ef", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "ab0dc25c", "metadata": {}, "source": [ "

Lecture 3.16 (Pandas-08)

" ] }, { "cell_type": "markdown", "id": "2aa8cebf", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "19f82705", "metadata": {}, "source": [ "\n", "\n", "## _Handling Missing Data.ipynb_" ] }, { "cell_type": "code", "execution_count": null, "id": "7fa27296", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "12db95e1", "metadata": {}, "source": [ "## Learning agenda of this notebook\n", "\n", "1. Have an insight about the Dataset\n", "2. Identify the Columns having Null/Missing values using `df.isna()` method\n", "3. Handle/Impute the Null/Missing Values under the `math` Column using `df.loc[mask,col]=value`\n", "4. Handle/Impute the Null/Missing Values under the `group` Column using `df.loc[mask,col]=value`\n", "5. Handle Missing values under a Numeric/Categorical Column using `fillna()`\n", "6. Handle Repeating Values (for same information) under the `session` Column\n", "7. Create a new Column by Modifying an Existing Column\n", "8. Delete Rows Having NaN values using `df.dropna()` method\n", "9. Convert Categorical Variables into Numerical" ] }, { "cell_type": "code", "execution_count": null, "id": "0024e3de", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c741f51b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d804305e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9f14bdc3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1d2bba7a", "metadata": {}, "source": [ "## 1. Have an Insight about the Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "3c18ecd6", "metadata": {}, "outputs": [], "source": [ "! cat datasets/group-marks.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "7144b3e0", "metadata": {}, "outputs": [], "source": [ "# import the pandas library\n", "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "cf6ed7b1", "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "markdown", "id": "8653443c", "metadata": {}, "source": [ "- Whenever the **`pd.read.csv()`** method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) it flags it with NaN. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.\n", "- There can also be a scenario that a user has entered some text under a numeric field about which he/she do not have any information." ] }, { "cell_type": "code", "execution_count": null, "id": "f37c0e6d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "078444a0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6e85228e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dd80dcff", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c20c7fd0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4dab5d98", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d9e05576", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e2505bd5", "metadata": {}, "source": [ "## 2. Identify the Columns having Null/Missing values\n", "- The **`df.isna()`** method is recommended to use than `df.isnull()`, which return a boolean same-sized object that indicates whether an element is NA value or not. Missing values get mapped to True. Everything else gets mapped to False values. Remember, characters such as empty strings ``''`` or `numpy.inf` are not considered NA values.\n", "- The **`df.notna()`** method is recommended to use than `df.notnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. Non-missing values get mapped to True. " ] }, { "cell_type": "code", "execution_count": null, "id": "c4f6a75c", "metadata": {}, "outputs": [], "source": [ "df.math.isna()" ] }, { "cell_type": "code", "execution_count": null, "id": "406d9230", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d7d48904", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f1776b52", "metadata": {}, "outputs": [], "source": [ "df.loc[df.math.isna(), :]" ] }, { "cell_type": "code", "execution_count": null, "id": "aa2ed31c", "metadata": {}, "outputs": [], "source": [ "df.isna().head()" ] }, { "cell_type": "code", "execution_count": null, "id": "af437830", "metadata": {}, "outputs": [], "source": [ "df.notna().head()" ] }, { "cell_type": "code", "execution_count": null, "id": "25b7c141", "metadata": {}, "outputs": [], "source": [ "# Now we can use sum() on this dataframe object of Boolean values (True is mapped to 1)\n", "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "966c78ce", "metadata": {}, "outputs": [], "source": [ "# Similarly, we can use sum() on this dataframe object of Boolean values (True is is mapped to 1)\n", "df.notna().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "4c595a48", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7202e219", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1cea055f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "45e02d50", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "14dd039f", "metadata": {}, "source": [ "## 3. Handle/Impute the Null/Missing Values under the `math` Column" ] }, { "cell_type": "markdown", "id": "190177e8", "metadata": {}, "source": [ "### a. Identify the Rows under the `math` Column having Null/Missing values\n", "- The `df.isna()` method works equally good on Series objects as well" ] }, { "cell_type": "code", "execution_count": null, "id": "b0853c04", "metadata": {}, "outputs": [], "source": [ "mask = df.math.isna()\n", "mask" ] }, { "cell_type": "code", "execution_count": null, "id": "88c4e19e", "metadata": {}, "outputs": [], "source": [ "df.loc[df.math.isna(), :]" ] }, { "cell_type": "code", "execution_count": null, "id": "72a967c6", "metadata": {}, "outputs": [], "source": [ "# This will return only those rows of dataframe having null values under the math column\n", "df[mask] # df[df.math.isna()]\n", "df.loc[mask, :] # df.loc[df.math.isna(), :]" ] }, { "cell_type": "code", "execution_count": null, "id": "e49c2c30", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d48768b4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7bf12448", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f5d527b7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "552c7ce9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "430a2bed", "metadata": {}, "source": [ "### b. Replace the Null/Missing Values under the `math` Column\n", "- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values under the `math` column\n", "- Suppose, we want to put the average values at the place of missing values." ] }, { "cell_type": "code", "execution_count": null, "id": "fec9c129", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "7c8046ed", "metadata": {}, "outputs": [], "source": [ "# Compute the mean of math column\n", "df.math.mean() " ] }, { "cell_type": "code", "execution_count": null, "id": "7a08ec59", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f9254c93", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5940b60b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ae5eca68", "metadata": {}, "source": [ "> By seeing the error, it appears that the `math` column do not have the `int64` or `float64` type. Let us check this out" ] }, { "cell_type": "code", "execution_count": null, "id": "d53fb2ad", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "67e345df", "metadata": {}, "outputs": [], "source": [ "# Check out the data type of math column\n", "df['math'].dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "e7f1898e", "metadata": {}, "outputs": [], "source": [ "# We can also use the `df.info()` method to display the count of Non-Null columns, their datatypes, their names \n", "# and memory usage of that dataframe.\n", "\n", "df.info()" ] }, { "cell_type": "markdown", "id": "e6affc95", "metadata": {}, "source": [ "- **What can be the reason for this?**\n", "- Let us check out the values under this column" ] }, { "cell_type": "code", "execution_count": null, "id": "739483d1", "metadata": {}, "outputs": [], "source": [ "df['math']" ] }, { "cell_type": "code", "execution_count": null, "id": "ce23aca2", "metadata": {}, "outputs": [], "source": [ "# We can replace all such values using the `replace()` method\n", "import numpy as np\n", "df['math'] = df.math.replace('No Idea', np.nan).head()" ] }, { "cell_type": "code", "execution_count": null, "id": "9bef252d", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "1bc13f55", "metadata": {}, "outputs": [], "source": [ "df.math.replace('No Idea', np.nan, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "1b6e438c", "metadata": {}, "outputs": [], "source": [ "# Note the marks of Saadia in math are changed from string `No Idea` to `NaN`\n", "# Since this seems working fine let us make inplace=True to make these changes in the original dataframe\n", "df.replace('No Idea', np.nan, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "8a3a4ac7", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "3c17ec0c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "af6fb4d9", "metadata": {}, "outputs": [], "source": [ "# Let us check the data type of math column\n", "df['math'].dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "67ec8d4e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "35ee44f3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c8008ee0", "metadata": {}, "outputs": [], "source": [ "# It is still Object, which is natural, however, we can change the datatype to `df.astype()` method\n", "df['math'] = df['math'].astype(float)" ] }, { "cell_type": "code", "execution_count": null, "id": "0ffeba3d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1c16b303", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6bf25352", "metadata": {}, "outputs": [], "source": [ "# Let us check the data type of math column\n", "df['math'].dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "39cb34ae", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "de2510bd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "315e10e5", "metadata": {}, "outputs": [], "source": [ "# Let us compute the average of math marks again \n", "df.math.mean() " ] }, { "cell_type": "code", "execution_count": null, "id": "19b5ba1d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7df22e86", "metadata": {}, "outputs": [], "source": [ "# List only those records under math column having Null values\n", "mask = df.math.isna()\n", "df.loc[mask, 'math']" ] }, { "cell_type": "code", "execution_count": null, "id": "849e288e", "metadata": {}, "outputs": [], "source": [ "df.math.mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "c21e0ffd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "01860b3a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "64df7693", "metadata": {}, "outputs": [], "source": [ "# Let us replace these values with mean value of the math column\n", "df.loc[(df.math.isna()),'math'] = df.math.mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "de8abade", "metadata": {}, "outputs": [], "source": [ "# Confirm the result\n", "df.isna().sum()\n", "#df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "d998cc44", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "2232212e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "87e56a44", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "abd8d783", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "80408f68", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e3819ac9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5ac35433", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "70ac58bc", "metadata": {}, "source": [ "## 4. Handle/Impute the Null/Missing Values under the `group` Column\n", "- The `group` column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values." ] }, { "cell_type": "markdown", "id": "d9ebb9ae", "metadata": {}, "source": [ "### a. Identify the Rows under the `group` Column having Null/Missing values" ] }, { "cell_type": "code", "execution_count": null, "id": "0bf14128", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "b5e93928", "metadata": {}, "outputs": [], "source": [ "mask = df.group.isna()\n", "mask" ] }, { "cell_type": "code", "execution_count": null, "id": "f7438404", "metadata": {}, "outputs": [], "source": [ "df[mask] # df[df.group.isna()]\n", "df.loc[mask, :] # df.loc[df.group.isna()]" ] }, { "cell_type": "code", "execution_count": null, "id": "7fbb9864", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d78b2e0f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6cbf038d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5b757b2f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "570d6f9e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "13ab25aa", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f638dd05", "metadata": {}, "source": [ "### b. Replace the Null/Missing Values under the `group` Column\n", "- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values\n", "- Since this is a categorical column having datatype object (group A, group B, group C, ...), so let us replace it with th value inside the column having the maximum frequency" ] }, { "cell_type": "code", "execution_count": null, "id": "5bfb3a36", "metadata": {}, "outputs": [], "source": [ "# Use value_counts() function which return a Series containing counts of unique values (in descending order)\n", "# with the most frequently-occurring element at first. It excludes NA values by default.\n", "df.group.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "b2c21ca5", "metadata": {}, "outputs": [], "source": [ "# Another way of doing is use the mode() function on the column\n", "df.group.mode() " ] }, { "cell_type": "code", "execution_count": null, "id": "66acc161", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b65a6998", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "cd40e2f0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e2f362b6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b1c578a7", "metadata": {}, "outputs": [], "source": [ "# List only those records under group column having Null values\n", "mask = df.group.isna()\n", "df.loc[mask, 'group'] # df.loc[(df.group.isna()), 'group']" ] }, { "cell_type": "code", "execution_count": null, "id": "6396dd02", "metadata": {}, "outputs": [], "source": [ "df.loc[(df.group.isna()),'group']" ] }, { "cell_type": "code", "execution_count": null, "id": "2758fefd", "metadata": {}, "outputs": [], "source": [ "# Let us replace these values with maximum occurring value in the `group` column\n", "df.loc[(df.group.isna()),'group'] = 'group C'" ] }, { "cell_type": "code", "execution_count": null, "id": "4ef7df81", "metadata": {}, "outputs": [], "source": [ "# Confirm the result\n", "df.isna().sum()\n", "#df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "69298cda", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "65193323", "metadata": {}, "source": [ ">Note that in the original dataframe Arifa group information was missing, and now it is `group C` " ] }, { "cell_type": "code", "execution_count": null, "id": "04fc667a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "348ea0cb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "fcf6cda5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5cda8f0e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b20a19ef", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7c954256", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "406a2755", "metadata": {}, "source": [ "## 5. Handle Missing values under a Numeric/Categorical Column using `fillna()`" ] }, { "cell_type": "markdown", "id": "697b16b8", "metadata": {}, "source": [ "### a. Replace the Null/Missing Values under the math Column using `fillna()`\n", "- This is more recommended way of filling in the Null values within columns of your dataset rather than the use of the `loc` method.\n", "```\n", "object.fillna(value, method, inplace=True)\n", "```\n", "- The only required argument is either the `value`, with which we want to replace the missing values OR the `method` to be used to replace the missing values\n", "- Returns object with missing values filled or None if ``inplace=True``" ] }, { "cell_type": "code", "execution_count": null, "id": "fe077e43", "metadata": {}, "outputs": [], "source": [ "# Let us read the dataset again with NA values under math column\n", "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv')" ] }, { "cell_type": "code", "execution_count": null, "id": "f5e4bb09", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "74f2206c", "metadata": {}, "source": [ ">- Before proceeding, let us this time handle the string value `No Idea` under the math column while reading the csv file, instead of doing afterwards in the dataframe using the `replace()` method as we have done above.\n", ">- For this we will use the `na_values` argument to the `pd.read_csv()` method, to which you can pass a single value or a list of values to be replaced with NaN" ] }, { "cell_type": "code", "execution_count": null, "id": "1b02a535", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')" ] }, { "cell_type": "code", "execution_count": null, "id": "cc3fd29e", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "e40e34c5", "metadata": {}, "outputs": [], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "f2de4792", "metadata": {}, "outputs": [], "source": [ "df.loc[df.math.isna()]" ] }, { "cell_type": "code", "execution_count": null, "id": "b00977e2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ed01e7a0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8007a41f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dd8541f8", "metadata": {}, "outputs": [], "source": [ "# This time instead of loc, use fillna() method with just two arguments\n", "# inplace=True parameter ensure that this happens in the original dataframe\n", "\n", "df.math.fillna(value=df.math.mean(), inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "60298853", "metadata": {}, "outputs": [], "source": [ "# Confirm the result\n", "df.isna().sum()\n", "#df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "daeac35d", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d1e84e0d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c54c3d39", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0e5204bd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e0d6f5d9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "926ec32f", "metadata": {}, "source": [ "### b. Replace the Null/Missing Values under the `group` Column using `fillna()`" ] }, { "cell_type": "code", "execution_count": null, "id": "ad051b97", "metadata": {}, "outputs": [], "source": [ "# Let us read the dataset again with NA values\n", "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d634b9d3", "metadata": {}, "outputs": [], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "826d9d2d", "metadata": {}, "outputs": [], "source": [ "# Once again instead of loc,let us use fillna() method with just two arguments\n", "\n", "df.group.fillna('group C', inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "b70222f2", "metadata": {}, "outputs": [], "source": [ "# Confirm the result\n", "df.isna().sum()\n", "#df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "e803f22b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8d032cb3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "06b34c06", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "04030de2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0de0de66", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "58bf2146", "metadata": {}, "outputs": [], "source": [ "# Let us fill the math, english and scholarship columns as well again\n", "df.math.fillna(df.math.mean(), inplace=True)\n", "df.english.fillna(df.english.mean(), inplace=True)\n", "df.scholarship.fillna(df.scholarship.mean(), inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "9c214fda", "metadata": {}, "outputs": [], "source": [ "# Confirm the result\n", "df.isna().sum()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2f66a04e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ab2d2d48", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ef5478ef", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ca660834", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "729361a1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5e88d9dc", "metadata": {}, "source": [ "### c. Replace the Null/Missing Values under the` math` and `group` Column using `ffill` and `bfill` Arguments\n", "- In above examples, we have used the mean value in case of numeric column and mode value in case of a categorical column as the filling value to the `fillna()` method\n", "```\n", "object.fillna(value, method, inplace=True)\n", "```\n", "\n", "- We can pass `ffill` or `bfill` as method argument to the `ffillna()` method. This will replace the null values with other values from the DataFrame\n", "- `ffill` (Forward fill): It fills the NaN value with the previous value\n", "- `bfill` (Back fill): It fills the NaN value with the Next/Upcoming value\n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "c42ec42e", "metadata": {}, "outputs": [], "source": [ "# Let us read the dataset again with NA values\n", "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')\n", "df.head(20)" ] }, { "cell_type": "code", "execution_count": null, "id": "1a2edccc", "metadata": {}, "outputs": [], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "1aa32a02", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b5f4d161", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "edab7c0d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6967c27d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1b0748ec", "metadata": {}, "outputs": [], "source": [ "# forward fill or ffill attribute\n", "# If have NaN value, just carry forward the previous value\n", "# using ffill attribute, you can fill the NaN value with the previous value in that column\n", "df.fillna(method = 'ffill', inplace=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "29d96cac", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dc337636", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c0558882", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d49023d8", "metadata": {}, "source": [ ">Is it working fine?" ] }, { "cell_type": "code", "execution_count": null, "id": "1a4a6277", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f4efd6f2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "84d8cd2d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3b085425", "metadata": {}, "outputs": [], "source": [ "df.fillna(method = 'bfill', inplace=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "dca48c0f", "metadata": {}, "outputs": [], "source": [ "# Confirm the result\n", "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "26d21ede", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "521e390e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "788f3da7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "89d68fa4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6cf5e126", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "77a90898", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0744d3ec", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "aaa38f45", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d55c6345", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "849f595d", "metadata": {}, "source": [ "## 6. Handle Repeating Values (for same information) under the `session` Column\n", "- If you observe the values under the `session` column, you can observe that it is a categorical column containing six different categories (as values).\n", " - Notice that the categories `MORNING` and `MOR` are same\n", " - Similarly, `AFTERNOON` and `AFT` are same\n", " - Similarly, `EVENING` and `EVE` are same\n", "- This happens when you have collected data from different sources, where same information is written in different ways\n", "- So the `session` column has six different categories (as values) but should have only three" ] }, { "cell_type": "code", "execution_count": null, "id": "39378b47", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv' )\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "6a11aa55", "metadata": {}, "outputs": [], "source": [ "df.session" ] }, { "cell_type": "code", "execution_count": null, "id": "6266ae48", "metadata": {}, "outputs": [], "source": [ "# Let use check out the counts of unique values inside the session Column\n", "df.session.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "4bebb2c6", "metadata": {}, "outputs": [], "source": [ "df.session.unique()" ] }, { "cell_type": "code", "execution_count": null, "id": "68e20639", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e160c362", "metadata": {}, "source": [ "### Handle the Repeating Values under the session Column using `map()`\n", "- To keep the data clean we will map all these values to only three categories to `MOR` , `AFT` and `EVE` using the map() function.\n", "```\n", "df.map(mapping, na_action=None)\n", "```\n", "- The `map()` method is used for substituting each value in a Series with another value, that may be derived from a `dict`. The `map()` method returns a series after performing the mapping\n", "- You can give `ignore` as second argument which will propagate NaN values, without passing them to the mapping correspondence." ] }, { "cell_type": "code", "execution_count": null, "id": "a258c9b8", "metadata": {}, "outputs": [], "source": [ "# To do this, let us create a new mapping (dictionary) \n", "dict1 = {\n", " 'MORNING' : 'MOR',\n", " 'MOR' : 'MOR',\n", " 'AFTERNOON' : 'AFT',\n", " 'AFT': 'AFT',\n", " 'EVENING' : 'EVE',\n", " 'EVE': 'EVE'\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "4d85f45f", "metadata": {}, "outputs": [], "source": [ "# It returns a series with the same index as caller, the original series remains unchanged. \n", "# So we have assigned the resulting series to `df.session` series\n", "df.session.map(dict1)" ] }, { "cell_type": "code", "execution_count": null, "id": "bd8c5c77", "metadata": {}, "outputs": [], "source": [ "df.session = df.session.map(dict1)" ] }, { "cell_type": "code", "execution_count": null, "id": "c20810ca", "metadata": {}, "outputs": [], "source": [ "# Count of new categories in the column session\n", "# Observe we have managed to properly manage the values inside the session column\n", "df.session.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "614f17db", "metadata": {}, "outputs": [], "source": [ "# Let us verify the result\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "2fe79ea0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2b1ef03b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "92af0ed0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ed4a973d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3f7cc250", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2b07d53e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f0d90de5", "metadata": {}, "source": [ "## 7. Create a new Column by Modifying an Existing Column\n", "- We have a column scholarship in the dataset, which is in Pak Rupees\n", "- Suppose you want to have a new column which should represent the scholarship in US Dollars\n", "- For that we need to add a new column by dividing each value of scholarship with 150" ] }, { "cell_type": "code", "execution_count": null, "id": "8c3133b5", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv' )\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "a164db22", "metadata": {}, "outputs": [], "source": [ "df.scholarship.apply(lambda x: x/170)" ] }, { "cell_type": "code", "execution_count": null, "id": "b0df8d4b", "metadata": {}, "outputs": [], "source": [ "df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/150)" ] }, { "cell_type": "code", "execution_count": null, "id": "e579102d", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "6c9328f6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "85c26466", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6dc33e00", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3864967f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "33eb8118", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "88eea76d", "metadata": {}, "source": [ "## 8. Delete Rows Having NaN values using `df.dropna()` method\n", "**```\n", "df.dropna(axis, how, subset, inplace)\n", "```**" ] }, { "cell_type": "code", "execution_count": null, "id": "ef83e54e", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b226513f", "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "ddac322b", "metadata": {}, "outputs": [], "source": [ "# You can use dropna() method to drop all the rows, it it has any na value\n", "df1 = df.dropna()\n", "df1.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "23e91e38", "metadata": {}, "outputs": [], "source": [ "df1.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "1b5d7f58", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3b1c5d46", "metadata": {}, "outputs": [], "source": [ "# Default Arguments to dropna()\n", "df2 = df.dropna(axis=0, how='any')\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "5776a24c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ae929bfd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "33388c27", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "37de9061", "metadata": {}, "outputs": [], "source": [ "# If we set how='all` it means drop a row only if all of its values are NA\n", "df2 = df.dropna(axis=0, how='all')\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "cb454e9c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dd313b95", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d3e33944", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3e9cc0b3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4a45c8b7", "metadata": {}, "outputs": [], "source": [ "# Use of subset argument and pass it a list of columns based on whose values you want to drop a row\n", "df2 = df.dropna(axis=0, how='any', subset=['math'])\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "49dc8ff7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b40e73a8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "98ea1093", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c409a382", "metadata": {}, "outputs": [], "source": [ "# Use of subset argument\n", "df2 = df.dropna(axis=0, how='any', subset=['session'])\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "59a3d051", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6d6dbf74", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "81cf8668", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "24afcf1c", "metadata": {}, "outputs": [], "source": [ "# Having `how=all` and `subset=listofcolumnnames`, then it will \n", "# drop a row only if both the columns have a NA value in that row\n", "df2 = df.dropna(axis=0, how='any', subset=['math', 'session'])\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "973223ef", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2c043826", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ebedf621", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "718b79a9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f13d514a", "metadata": {}, "outputs": [], "source": [ "# If we set the axis=1 and how=all, it means drop a column if all the values under it is na\n", "df2 = df.dropna(axis=1, how='all')\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "ec9262c2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "377e62b9", "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "8941aec9", "metadata": {}, "outputs": [], "source": [ "df2 = df.dropna(axis=1, how='any')\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "b104da0e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c779dfff", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b319edee", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ee863cea", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "94a5c257", "metadata": {}, "outputs": [], "source": [ "# If we set the axis=1 and how=any, it means drop a column if any value under it is na\n", "df2 = df.dropna(axis=1, how='any')\n", "df2.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "cf1c032c", "metadata": {}, "outputs": [], "source": [ "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "5bc00056", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ae31c28f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "864c9378", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f4e38815", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "24231c5e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d527fdbe", "metadata": {}, "source": [ "## 9. Convert Categorical Variables into Numerical\n", "- Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. \n", "- We can do this using Pandas function `pd.get_dummies()`, which will create a binary column for each of the categories. \n", "```\n", "pd.get_dummies(data, drop_first=False)\n", "```\n", "- Where, the only required argument is `data` which can be a dataframe or a series\n", "- The parameter drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.\n", "\n", "**Note:** Making a dummy variable will take all the `K` distinct values in one coumn and make `K` columns out of them" ] }, { "cell_type": "markdown", "id": "c52c50a3", "metadata": {}, "source": [ "### a. Convert all categorical variables into dummy/indicator variables" ] }, { "cell_type": "code", "execution_count": null, "id": "36e83961", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "ddb9d493", "metadata": {}, "outputs": [], "source": [ "# currently we have 10 columns in the data\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "70d3c674", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c081ee33", "metadata": {}, "outputs": [], "source": [ "# Convert all categorical variables into dummy/indicator variables\n", "df = pd.get_dummies(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "23c9ae78", "metadata": {}, "outputs": [], "source": [ "# Let us view the datafreame, keep a note on the number of columns\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "0ed4ca5c", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# The Number of columns has gone to 1605 now\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "9d766891", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "id": "eb94e433", "metadata": {}, "source": [ "- So we have 112 columns\n", "- It adds a lot of dimensionality to your data, i.e., increase the number of columns\n", "- It also become difficult to deal with that much number of columns\n", "- This is a trade-off, which is handled by technique called dimensionality reduction" ] }, { "cell_type": "code", "execution_count": null, "id": "a956c709", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "837e3f52", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3c8064be", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bf725cc0", "metadata": {}, "source": [ "### b. Perform One-Hot Encoding for Categorical Column `gender` Only\n", "- In our dataframe, the gender column is a categorical column having two values 'male' and 'female'\n", "- It will create a dummy binary columns. \n", "- This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "c2f41e74", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnonamegendergroupsessionagescholarshipmathenglishurdu
0MS01SAADIAfemalegroup BMORNING282562No Idea72.074
1MS02JUMAIMAfemalegroup CAFTERNOON3328006990.088
2MS03ARIFAfemaleNaNEVENING343500NaN95.093
3MS04SAADIAfemalegroup AMOR4420004757.044
4MS05DANISHmalegroup CAFTERNOON5421007678.055
\n", "
" ], "text/plain": [ " rollno name gender group session age scholarship math \\\n", "0 MS01 SAADIA female group B MORNING 28 2562 No Idea \n", "1 MS02 JUMAIMA female group C AFTERNOON 33 2800 69 \n", "2 MS03 ARIFA female NaN EVENING 34 3500 NaN \n", "3 MS04 SAADIA female group A MOR 44 2000 47 \n", "4 MS05 DANISH male group C AFTERNOON 54 2100 76 \n", "\n", " english urdu \n", "0 72.0 74 \n", "1 90.0 88 \n", "2 95.0 93 \n", "3 57.0 44 \n", "4 78.0 55 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df1 = pd.read_csv('datasets/group-marks.csv')\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "cb554eba", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "058940b3", "metadata": {}, "outputs": [], "source": [ "# Convert only gender variable into dummy/indicator variables\n", "df2 = pd.get_dummies(df1[['gender']])\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "8a111866", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9d07731b", "metadata": {}, "outputs": [], "source": [ "# Since we donot need two separate columns, so simply use the `drop_first` argument of get_dummies to handle this\n", "df2 = pd.get_dummies(df1[['gender']], drop_first=True)\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "48e7a40e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5633aaec", "metadata": {}, "outputs": [], "source": [ "# We will talk about join in the next session in detail.\n", "df3 = df1.join(df2['gender_male'])\n", "df3.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "17c7f244", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "218df74c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "bee19d23", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8f25950c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 35, "id": "684dbde3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnonamegendergroupsessionagescholarshipmathenglishurdu
0MS01SAADIAfemalegroup BMORNING282562No Idea72.074
1MS02JUMAIMAfemalegroup CAFTERNOON3328006990.088
2MS03ARIFAfemaleNaNEVENING343500NaN95.093
3MS04SAADIAfemalegroup AMOR4420004757.044
4MS05DANISHmalegroup CAFTERNOON5421007678.055
\n", "
" ], "text/plain": [ " rollno name gender group session age scholarship math \\\n", "0 MS01 SAADIA female group B MORNING 28 2562 No Idea \n", "1 MS02 JUMAIMA female group C AFTERNOON 33 2800 69 \n", "2 MS03 ARIFA female NaN EVENING 34 3500 NaN \n", "3 MS04 SAADIA female group A MOR 44 2000 47 \n", "4 MS05 DANISH male group C AFTERNOON 54 2100 76 \n", "\n", " english urdu \n", "0 72.0 74 \n", "1 90.0 88 \n", "2 95.0 93 \n", "3 57.0 44 \n", "4 78.0 55 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/group-marks.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 36, "id": "dede203f", "metadata": {}, "outputs": [], "source": [ "df.session.value_counts()\n", "dict1 = {\n", " 'MORNING' : 'MOR',\n", " 'AFTERNOON' : 'AFT',\n", "}\n", "#df.session = df.session.map(dict1)\n", "df.session = df.session.replace(dict1)\n", "\n" ] }, { "cell_type": "code", "execution_count": 37, "id": "a5aa5677", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnonamegendergroupsessionagescholarshipmathenglishurdu
0MS01SAADIAfemalegroup BMOR282562No Idea72.074
1MS02JUMAIMAfemalegroup CAFT3328006990.088
2MS03ARIFAfemaleNaNEVENING343500NaN95.093
3MS04SAADIAfemalegroup AMOR4420004757.044
4MS05DANISHmalegroup CAFT5421007678.055
5MS06SAFIAfemalegroup BAFT233800NaN83.078
6MS07SARAfemalegroup BEVENING4730008895.092
7MS08ABDULLAHmalegroup BEVE3320004043.039
8MS09KHANmalegroup DMOR27250064NaN67
9MS10HASEENAfemalegroup BAFT3328003860.050
10MS11MUSTJABmalegroup CMOR4630005854.052
11MS12ABRARmalegroup DMOR5333124052.043
12MS13MAHOORfemaleNaNMOR2523456581.073
13MS14USAMAmalegroup AAFT2626547872.070
14MS15NAVAIRAfemalegroup AAFT2521375053.058
15MS16SAWAIRAfemalegroup CEVENING2925676975.078
16MS17NOFILmalegroup CMOR2235008889.086
17MS18SHUMAILAfemalegroup BAFT31250018NaN28
18MS19ABUBAKARmalegroup CAFT3330004642.046
19MS20NADRAfemalegroup CMOR3622005458.045
20MS21HASEEBmalegroup DAFT3435006669.063
21MS22HOORIAfemalegroup BEVENING3220006575.070
22MS23RAUFmalegroup DMOR3125004454.053
23MS24LAIBAfemalegroup CAFT373000NaN73.073
24MS25FATIMAmalegroup DAFT3325007471.080
25MS26IBRARmalegroup AEVENING3935007374.072
26MS27USAMAmalegroup BEVE3225006954.055
27MS28NOORfemalegroup CMOR4730006769.075
28MS29UMARmalegroup CAFT4440007070.065
29MS30ROSHANfemalegroup DMOR4235006270.075
30MS31UNAIZAfemalegroup DEVENING4125006974.074
31MS32FATIMAfemalegroup BEVE19300063NaN61
32MS33SHAISTAfemaleNaNMOR2935005672.065
33MS34ALImalegroup DEVENING3430004042.038
34MS35ABDULLAHmalegroup EMOR4525009787.082
35MS36OSAMAmalegroup EAFT3135008181.079
36MS37AMBREENfemalegroup DEVENING2830007481.062
37MS38NASEEMfemalegroup DMOR2625005064.059
38MS39SAFIAfemalegroup DAFT3635007590.088
39MS40NABEELmalegroup BAFT3330005756.057
40MS41SIKANDERmalegroup CEVENING3525005561.054
41MS42ZAINfemalegroup CEVE2735005873.068
42MS43AYESHAfemalegroup BMOR3830005358.065
43MS44HASSANmalegroup BAFT3125005965.066
44MS45ZAINABfemalegroup EMOR283500NaN56.054
45MS46TALHAmalegroup BEVENING4130006554.057
46MS47SABAfemalegroup AEVE3625005565.062
47MS48RAFIAfemalegroup CMOR3035006671.076
48MS49FATIMAfemalegroup DMOR4025005774.076
49MS50KAKAMANNAmalegroup CAFT3730006678.081
\n", "
" ], "text/plain": [ " rollno name gender group session age scholarship math \\\n", "0 MS01 SAADIA female group B MOR 28 2562 No Idea \n", "1 MS02 JUMAIMA female group C AFT 33 2800 69 \n", "2 MS03 ARIFA female NaN EVENING 34 3500 NaN \n", "3 MS04 SAADIA female group A MOR 44 2000 47 \n", "4 MS05 DANISH male group C AFT 54 2100 76 \n", "5 MS06 SAFIA female group B AFT 23 3800 NaN \n", "6 MS07 SARA female group B EVENING 47 3000 88 \n", "7 MS08 ABDULLAH male group B EVE 33 2000 40 \n", "8 MS09 KHAN male group D MOR 27 2500 64 \n", "9 MS10 HASEENA female group B AFT 33 2800 38 \n", "10 MS11 MUSTJAB male group C MOR 46 3000 58 \n", "11 MS12 ABRAR male group D MOR 53 3312 40 \n", "12 MS13 MAHOOR female NaN MOR 25 2345 65 \n", "13 MS14 USAMA male group A AFT 26 2654 78 \n", "14 MS15 NAVAIRA female group A AFT 25 2137 50 \n", "15 MS16 SAWAIRA female group C EVENING 29 2567 69 \n", "16 MS17 NOFIL male group C MOR 22 3500 88 \n", "17 MS18 SHUMAILA female group B AFT 31 2500 18 \n", "18 MS19 ABUBAKAR male group C AFT 33 3000 46 \n", "19 MS20 NADRA female group C MOR 36 2200 54 \n", "20 MS21 HASEEB male group D AFT 34 3500 66 \n", "21 MS22 HOORIA female group B EVENING 32 2000 65 \n", "22 MS23 RAUF male group D MOR 31 2500 44 \n", "23 MS24 LAIBA female group C AFT 37 3000 NaN \n", "24 MS25 FATIMA male group D AFT 33 2500 74 \n", "25 MS26 IBRAR male group A EVENING 39 3500 73 \n", "26 MS27 USAMA male group B EVE 32 2500 69 \n", "27 MS28 NOOR female group C MOR 47 3000 67 \n", "28 MS29 UMAR male group C AFT 44 4000 70 \n", "29 MS30 ROSHAN female group D MOR 42 3500 62 \n", "30 MS31 UNAIZA female group D EVENING 41 2500 69 \n", "31 MS32 FATIMA female group B EVE 19 3000 63 \n", "32 MS33 SHAISTA female NaN MOR 29 3500 56 \n", "33 MS34 ALI male group D EVENING 34 3000 40 \n", "34 MS35 ABDULLAH male group E MOR 45 2500 97 \n", "35 MS36 OSAMA male group E AFT 31 3500 81 \n", "36 MS37 AMBREEN female group D EVENING 28 3000 74 \n", "37 MS38 NASEEM female group D MOR 26 2500 50 \n", "38 MS39 SAFIA female group D AFT 36 3500 75 \n", "39 MS40 NABEEL male group B AFT 33 3000 57 \n", "40 MS41 SIKANDER male group C EVENING 35 2500 55 \n", "41 MS42 ZAIN female group C EVE 27 3500 58 \n", "42 MS43 AYESHA female group B MOR 38 3000 53 \n", "43 MS44 HASSAN male group B AFT 31 2500 59 \n", "44 MS45 ZAINAB female group E MOR 28 3500 NaN \n", "45 MS46 TALHA male group B EVENING 41 3000 65 \n", "46 MS47 SABA female group A EVE 36 2500 55 \n", "47 MS48 RAFIA female group C MOR 30 3500 66 \n", "48 MS49 FATIMA female group D MOR 40 2500 57 \n", "49 MS50 KAKAMANNA male group C AFT 37 3000 66 \n", "\n", " english urdu \n", "0 72.0 74 \n", "1 90.0 88 \n", "2 95.0 93 \n", "3 57.0 44 \n", "4 78.0 55 \n", "5 83.0 78 \n", "6 95.0 92 \n", "7 43.0 39 \n", "8 NaN 67 \n", "9 60.0 50 \n", "10 54.0 52 \n", "11 52.0 43 \n", "12 81.0 73 \n", "13 72.0 70 \n", "14 53.0 58 \n", "15 75.0 78 \n", "16 89.0 86 \n", "17 NaN 28 \n", "18 42.0 46 \n", "19 58.0 45 \n", "20 69.0 63 \n", "21 75.0 70 \n", "22 54.0 53 \n", "23 73.0 73 \n", "24 71.0 80 \n", "25 74.0 72 \n", "26 54.0 55 \n", "27 69.0 75 \n", "28 70.0 65 \n", "29 70.0 75 \n", "30 74.0 74 \n", "31 NaN 61 \n", "32 72.0 65 \n", "33 42.0 38 \n", "34 87.0 82 \n", "35 81.0 79 \n", "36 81.0 62 \n", "37 64.0 59 \n", "38 90.0 88 \n", "39 56.0 57 \n", "40 61.0 54 \n", "41 73.0 68 \n", "42 58.0 65 \n", "43 65.0 66 \n", "44 56.0 54 \n", "45 54.0 57 \n", "46 65.0 62 \n", "47 71.0 76 \n", "48 74.0 76 \n", "49 78.0 81 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "4223b274", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9341d7a6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 24, "id": "ad4df941", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AFTAFTERNOONEVEEVENINGMORMORNING
0000001
1010000
2000100
3000010
4010000
\n", "
" ], "text/plain": [ " AFT AFTERNOON EVE EVENING MOR MORNING\n", "0 0 0 0 0 0 1\n", "1 0 1 0 0 0 0\n", "2 0 0 0 1 0 0\n", "3 0 0 0 0 1 0\n", "4 0 1 0 0 0 0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.get_dummies(df.session)\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "26a9a4e2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f235f8c5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "39a3b975", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 20, "id": "a84648f2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnonamegendergroupsessionagescholarshipmathenglishurdumale
0MS01SAADIAfemalegroup BMORNING282562No Idea72.0740
1MS02JUMAIMAfemalegroup CAFTERNOON3328006990.0880
2MS03ARIFAfemaleNaNEVENING343500NaN95.0930
3MS04SAADIAfemalegroup AMOR4420004757.0440
4MS05DANISHmalegroup CAFTERNOON5421007678.0551
\n", "
" ], "text/plain": [ " rollno name gender group session age scholarship math \\\n", "0 MS01 SAADIA female group B MORNING 28 2562 No Idea \n", "1 MS02 JUMAIMA female group C AFTERNOON 33 2800 69 \n", "2 MS03 ARIFA female NaN EVENING 34 3500 NaN \n", "3 MS04 SAADIA female group A MOR 44 2000 47 \n", "4 MS05 DANISH male group C AFTERNOON 54 2100 76 \n", "\n", " english urdu male \n", "0 72.0 74 0 \n", "1 90.0 88 0 \n", "2 95.0 93 0 \n", "3 57.0 44 0 \n", "4 78.0 55 1 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.get_dummies(df.gender, drop_first=True)\n", "df3 = df.join(df1['male'])\n", "df3.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "7035f8e1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1028a381", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 22, "id": "d0fbbba0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnonamegendergroupsessionagescholarshipmathenglishurdugender_male
0MS01SAADIAfemalegroup BMORNING282562No Idea72.0740
1MS02JUMAIMAfemalegroup CAFTERNOON3328006990.0880
2MS03ARIFAfemaleNaNEVENING343500NaN95.0930
3MS04SAADIAfemalegroup AMOR4420004757.0440
4MS05DANISHmalegroup CAFTERNOON5421007678.0551
\n", "
" ], "text/plain": [ " rollno name gender group session age scholarship math \\\n", "0 MS01 SAADIA female group B MORNING 28 2562 No Idea \n", "1 MS02 JUMAIMA female group C AFTERNOON 33 2800 69 \n", "2 MS03 ARIFA female NaN EVENING 34 3500 NaN \n", "3 MS04 SAADIA female group A MOR 44 2000 47 \n", "4 MS05 DANISH male group C AFTERNOON 54 2100 76 \n", "\n", " english urdu gender_male \n", "0 72.0 74 0 \n", "1 90.0 88 0 \n", "2 95.0 93 0 \n", "3 57.0 44 0 \n", "4 78.0 55 1 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.get_dummies(df[['gender']], drop_first=True)\n", "df3 = df.join(df1['gender_male'])\n", "df3.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "1ef03d88", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rollnonamegendergroupsessionagescholarshipmathenglishurdugender_male
0MS01SAADIAfemalegroup BMORNING282562No Idea72.0740
1MS02JUMAIMAfemalegroup CAFTERNOON3328006990.0880
2MS03ARIFAfemaleNaNEVENING343500NaN95.0930
3MS04SAADIAfemalegroup AMOR4420004757.0440
4MS05DANISHmalegroup CAFTERNOON5421007678.0551
\n", "
" ], "text/plain": [ " rollno name gender group session age scholarship math \\\n", "0 MS01 SAADIA female group B MORNING 28 2562 No Idea \n", "1 MS02 JUMAIMA female group C AFTERNOON 33 2800 69 \n", "2 MS03 ARIFA female NaN EVENING 34 3500 NaN \n", "3 MS04 SAADIA female group A MOR 44 2000 47 \n", "4 MS05 DANISH male group C AFTERNOON 54 2100 76 \n", "\n", " english urdu gender_male \n", "0 72.0 74 0 \n", "1 90.0 88 0 \n", "2 95.0 93 0 \n", "3 57.0 44 0 \n", "4 78.0 55 1 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = df.join(df1['gender_male'])\n", "df3.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "25cdfe07", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }