{ "cells": [ { "cell_type": "markdown", "id": "4a87b5ef", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "ab0dc25c", "metadata": {}, "source": [ "

Lecture 3.15 (Pandas-07)

" ] }, { "cell_type": "markdown", "id": "1f1a9282", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "19f82705", "metadata": {}, "source": [ "\n", "\n", "## _Modifying Dataframes Part-II_" ] }, { "cell_type": "code", "execution_count": null, "id": "29077002", "metadata": {}, "outputs": [], "source": [ "# To install this library in Jupyter notebook\n", "#import sys\n", "#!{sys.executable} -m pip install pandas" ] }, { "cell_type": "code", "execution_count": null, "id": "9c0b2727", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.__version__ , pd.__path__" ] }, { "cell_type": "markdown", "id": "0ec78c62", "metadata": {}, "source": [ "

A Dataframe is a two-dimensional labeled data structure with heterogeneously typed columns, having both a row and a column indices.

" ] }, { "cell_type": "markdown", "id": "12db95e1", "metadata": {}, "source": [ "\n", "\n", "## Learning agenda of this notebook\n", "- **Recap:**\n", " - Modifying Column names of Dataframe\n", " - Modifying Row indices of Dataframe\n", " - Modifying Data inside a Dataframe (Row-wise, Column-wise, Element-wise)\n", "\n", "\n", "1. Add a New Column in a Dataframe\n", "2. Delete an Existing Column from a Dataframe\n", "3. Add a New Row in a Dataframe\n", "4. Delete an Existing Row(s) from a Dataframe\n", "5. Adding a New Column with Conditional Values\n", "6. Deleting Row(s) Based on Specific Condition\n", "7. Delete a Column Based on Specific Condition\n", "8. Change Datatype of a Pandas Series\n", "9. Sorting dataframes using `df.sort_values()`\n", "10. Sorting dataframes using `df.sort_index()`" ] }, { "cell_type": "markdown", "id": "73464b5a", "metadata": {}, "source": [ "## Read a Sample Dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "6914759c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "2bcd74a3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f48af20d", "metadata": {}, "outputs": [], "source": [ "# `shape` attribute of a dataframe object return a two value tuple containing rows and columns\n", "# Note the rows count does not include the column labels and column count does not include the row index\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "8c00cc7d", "metadata": {}, "outputs": [], "source": [ "# `index` attribute of a dataframe object return the list of row indices and its datatype\n", "df.index" ] }, { "cell_type": "code", "execution_count": null, "id": "b8a7fc1b", "metadata": {}, "outputs": [], "source": [ "# `columns` attribute of a dataframe object return the list of column labels and its datatype\n", "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "899d087f", "metadata": {}, "outputs": [], "source": [ "# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "b45b8a1b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0f1b5c02", "metadata": {}, "source": [ "## 1. Add a New Column in a Dataframe\n", "- To add a new column in a dataframe, create an appropriate series and then assign it to the dataframe\n", "- Every time a new series is added to a dataframe, its name automatically becomes an attribute of that dataframe.\n", "- It can be a series created from scratch, which can be cumbersome if the dataframe has thousands of rows.\n", "- Another common way to add a column is construct a series from the existing data within the dataframe\n", "- Let us understand this with an example" ] }, { "cell_type": "code", "execution_count": null, "id": "c298a8aa", "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "fc5f8f87", "metadata": {}, "outputs": [], "source": [ "df.subj1 + df.subj2" ] }, { "cell_type": "code", "execution_count": null, "id": "c7bd9466", "metadata": {}, "outputs": [], "source": [ "df.subj1.add(df.subj2)" ] }, { "cell_type": "code", "execution_count": null, "id": "94e5cc04", "metadata": {}, "outputs": [], "source": [ "ser1 = df.subj1.add(df.subj2, fill_value=0)\n", "ser1" ] }, { "cell_type": "code", "execution_count": null, "id": "e44e16a6", "metadata": {}, "outputs": [], "source": [ "# On the left side of assignment you must use `[]` operator, while on the right you can use dot operator as well\n", "df['total'] = ser1" ] }, { "cell_type": "markdown", "id": "9e06f90b", "metadata": {}, "source": [ "Note that once, nothing appears to happen after you execute a Jupyter notebook cell, that means some processing has been done in the background. Over here, a new column has been added to the dataframe named df. Let us confirm this" ] }, { "cell_type": "code", "execution_count": null, "id": "edbe41d9", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "061a339b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a65c85c3", "metadata": {}, "source": [ "## 2. Delete an Existing Column from a Dataframe\n", "- You can use any of the following ways to delete a column from a dataframe:\n", " - Use `del df['colname']`, which will remove the column, but will not return it\n", " - Use `df.pop('colname')` method which will remove that column as well as return the deleted column as a series\n", " - Use `df.drop()` is a better method than the above two. It can delete more than one columns and is not inplace. Moreover, it can be used to delete rows as well" ] }, { "cell_type": "code", "execution_count": null, "id": "916264d4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3d4cbb78", "metadata": {}, "source": [ "### a. Option 1: Using `del df['colname']`\n", "- The `del df['colname']` will remove the column without returning it. It is inplace" ] }, { "cell_type": "code", "execution_count": null, "id": "0b1f41e6", "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "c1c82f89", "metadata": {}, "outputs": [], "source": [ "del df['total']" ] }, { "cell_type": "code", "execution_count": null, "id": "9ee94a43", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "f947930c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "837296e5", "metadata": {}, "source": [ "### b. Option 2: Using `df.pop('colname')`\n", "- The `df.pop('colname')` method will remove the column and will return the Series that has been removed from the dataframe. It is inplace" ] }, { "cell_type": "code", "execution_count": null, "id": "6370aee5", "metadata": {}, "outputs": [], "source": [ "df.pop('address')" ] }, { "cell_type": "code", "execution_count": null, "id": "a8adebb4", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "bd380b44", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4fba9714", "metadata": {}, "source": [ "### c. Option 3: Using `df.drop()`\n", "- The `df.drop()` method is used to remove one or more columns and will return a Series or Dataframe object accordingly.\n", "\n", "```\n", "df.drop(columns=[---], axis=1, inplace=False)\n", "```\n", "- If you want to drop more than one columns, pass the names of columns to be deleted as a Python List to the `columns` parameter and to the `axis` argument pass 1. (`axis` argument specifies the direction of operation, which is left to right while deleting columns)\n", "- By default it is not inplace. Most of Pandas methods that return a dataframe has an inplace paremeter with default value set to False. It means the operation will not effect the underlying change" ] }, { "cell_type": "code", "execution_count": null, "id": "721eb597", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "b1d18c63", "metadata": {}, "outputs": [], "source": [ "df.drop(columns='name')" ] }, { "cell_type": "code", "execution_count": null, "id": "0edec89b", "metadata": {}, "outputs": [], "source": [ "# Remember axis is the direction of operation, and axis=1 is the column axis that goes from left to right\n", "df.drop(columns=['name', 'age', 'address'], axis=1)" ] }, { "cell_type": "markdown", "id": "1e1baae4", "metadata": {}, "source": [ "It has just returned the resulting dataframe after removing the columns. No change has made to the original dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "f24346cd", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "id": "5088f1bd", "metadata": {}, "source": [ "Let us repeat the operation, with `inplace=True`. Note this time it will return None. However, changes has been made to the original dataframe " ] }, { "cell_type": "code", "execution_count": null, "id": "6fcc8100", "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['age', 'address', 'name'], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "f4d51104", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "bf66e759", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6f15691b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "67be32b8", "metadata": {}, "source": [ "## 3. Add a New Row in a Dataframe\n", "- To add a new row in a dataframe, create an appropriate dataframe and then use `df.append()` method, which will return a new dataframe with the row added.\n", "```\n", "df.append(other, ignore_index=False)\n", "```\n", "\n", "**More on append in next session**" ] }, { "cell_type": "code", "execution_count": null, "id": "0359876f", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": null, "id": "c546f0e7", "metadata": {}, "outputs": [], "source": [ "# Let us create a new dataframe having a single row\n", "newdf = pd.DataFrame(data=[['MS222', 100, 'Kamokey', 'AFT', 'group D', 'Male', 55.0, 55.0, 9999]],\n", " columns=['roll no', 'age', 'address', 'session', 'group', 'gender','subj1', 'subj2', 'scholarship'])\n", "newdf" ] }, { "cell_type": "code", "execution_count": null, "id": "0ace24b9", "metadata": {}, "outputs": [], "source": [ "# Let us create a new dataframe having a single row (Can always create one having multiple rows). \n", "# Do note that we have not mentioned the scholarship data value as well as the scholarship column name\n", "\n", "newdf = pd.DataFrame(data=[['New Student', 'MS222', 100, 'Kamokey', 'AFT', 'group D', 'Male', 55.0, 55.0]],\n", " columns=['name', 'roll no', 'age', 'address', 'session', 'group', 'gender','subj1', 'subj2'])\n", "\n", "\n", "newdf" ] }, { "cell_type": "markdown", "id": "ceec2cdb", "metadata": {}, "source": [ "Note: The index associated with the only row in above dataframe is 0. Moreover, the sequence of columns is not same (name is coming before rollno)" ] }, { "cell_type": "code", "execution_count": null, "id": "1d82d20b", "metadata": { "scrolled": true }, "outputs": [], "source": [ "df = df.append(newdf , ignore_index=True)\n", "df.tail()" ] }, { "cell_type": "markdown", "id": "289c6eb8", "metadata": {}, "source": [ "- Note that the due to `ignore_index=True` argument it has been assigned the next available index. Otherwise, the new row will also have row index 0.\n", "- Moreover, note the NaN value under the scholarship column against the new added row\n", "- One last thing, the `df.append()` method do not have `inplace` argument, so you always have to assign the resulting dataframe to the original `df`. \n", "- Please check this out as to why `df.drop()` has `inplace` argument, while **`df.append()` does not have `inplace` argument.**" ] }, { "cell_type": "code", "execution_count": null, "id": "50a72b42", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6319a646", "metadata": {}, "source": [ "## 4. Delete an Existing Row(s) from a Dataframe\n", "- The `df.drop()` method is used to remove one or more rows (other than columns) and will return a Series or Dataframe object accordingly.\n", "\n", "```\n", "df.drop(index=[---], axis=0, inplace=False)\n", "```\n", "- If you want to drop more than one rows, pass the row indices to be deleted as a Python List to the `index` parameter and to the `axis` argument pass 0. (`axis` argument specifies the direction of operation, which is top to bottom while deleting rows)\n", "- By default it is not inplace. Most of Pandas methods that return a dataframe has an inplace paremeter with default value set to False. It means the operation will not effect the underlying change" ] }, { "cell_type": "code", "execution_count": null, "id": "152b4ab0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d1a0af57", "metadata": {}, "outputs": [], "source": [ "df.drop(index=[2,4], axis=0, inplace = True)" ] }, { "cell_type": "code", "execution_count": null, "id": "6aeb6b03", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "29f44020", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "902c85fb", "metadata": {}, "source": [ "## 5. Adding a New Column with Conditional Values" ] }, { "cell_type": "markdown", "id": "bc74fa4d", "metadata": {}, "source": [ "**Create a Simple Dataframe**" ] }, { "cell_type": "code", "execution_count": null, "id": "5efdd999", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "3de5510a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9525846c", "metadata": {}, "source": [ "**Example:** Add a new column `total` that contains sum of marks in `subj1` and `subj2`. Then add a new column that should contain string `\"Good\"` if `total>150` other wise string `\"Bad\"`" ] }, { "cell_type": "code", "execution_count": null, "id": "adcc0cc4", "metadata": {}, "outputs": [], "source": [ "df['total'] = df.subj1 + df.subj2\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "2d0049d1", "metadata": {}, "outputs": [], "source": [ "list1 = ['Good' if i >=150 else 'Bad' for i in df.total]\n", "list1" ] }, { "cell_type": "code", "execution_count": null, "id": "8d203125", "metadata": {}, "outputs": [], "source": [ "df['grade'] = ['Good' if i >=150 else 'Bad' for i in df.total]\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "19fc7d10", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2146aabb", "metadata": {}, "source": [ "## 6. Deleting Row(s) Based on Specific Condition" ] }, { "cell_type": "code", "execution_count": null, "id": "76782bc7", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "9df75dac", "metadata": {}, "outputs": [], "source": [ "df[df['address'] == 'Lahore']" ] }, { "cell_type": "code", "execution_count": null, "id": "20782468", "metadata": {}, "outputs": [], "source": [ "df[df['address'] == 'Lahore'].index" ] }, { "cell_type": "code", "execution_count": null, "id": "4934dc49", "metadata": {}, "outputs": [], "source": [ "df.drop(index = df[df['address'] == 'Lahore'].index, axis = 0, inplace = True)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "1734d460", "metadata": {}, "outputs": [], "source": [ "# Let us drop an entire row from the data frame, in which session is 'AFT'\n", "# Get the indices where session == 'AFT' using the .index function\n", "indices = df[df['session'] == 'AFT'].index\n", "indices\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6d63cfca", "metadata": {}, "outputs": [], "source": [ "# Pass those indices to the drop method to delete those rows\n", "df.drop(index = indices, inplace = True)" ] }, { "cell_type": "code", "execution_count": null, "id": "854b805c", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "b7dcfad6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3a057e47", "metadata": {}, "source": [ "## 7. Delete a Column Based on Specific Condition" ] }, { "cell_type": "code", "execution_count": null, "id": "b89a4aef", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "5b09f1b6", "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "ecc13174", "metadata": {}, "source": [ "**Example:** Let us drop the column(s) from above data frame, if it has >= than 2 NaN values" ] }, { "cell_type": "code", "execution_count": null, "id": "c7820ce8", "metadata": {}, "outputs": [], "source": [ "mylist_mask=df.apply(lambda col: col.isnull().sum() >= 1)\n", "mylist_mask" ] }, { "cell_type": "code", "execution_count": null, "id": "e86f53f8", "metadata": {}, "outputs": [], "source": [ "mylist_names=df.columns[mylist_mask]\n", "mylist_names" ] }, { "cell_type": "code", "execution_count": null, "id": "caf76b7c", "metadata": {}, "outputs": [], "source": [ "df.drop(columns=mylist_names, axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "003388c6", "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "a437cbdb", "metadata": {}, "outputs": [], "source": [ "# It will delete the Scholarship column\n", "df.drop(columns=df.columns[df.apply(lambda col: col.isnull().sum() >= 2)], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "9bcc3676", "metadata": {}, "outputs": [], "source": [ "# Verify\n", "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "6e24bef0", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "90baa7ba", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "81185d7f", "metadata": {}, "source": [ "## 8. Change Datatype of a Pandas Series\n", "- Use the `astype(dtype)` method to cast a pandas object to a specified dtype ``dtype``." ] }, { "cell_type": "markdown", "id": "bff3d3a0", "metadata": {}, "source": [ "### a. Changing Datatype from `int64` to `float64`" ] }, { "cell_type": "code", "execution_count": null, "id": "1e135f0c", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "64cb4509", "metadata": {}, "outputs": [], "source": [ "#Suppose we want to change the datatype of `age` column to float64 dtype\n", "df['age'] = df.age.astype(float)\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "12abba3c", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "521130df", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6a460f30", "metadata": {}, "source": [ "### b. Changing Datatype from string to boolean" ] }, { "cell_type": "code", "execution_count": null, "id": "a8d82038", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "0006ca1c", "metadata": {}, "outputs": [], "source": [ "df.gender.str.contains('Male')" ] }, { "cell_type": "code", "execution_count": null, "id": "fabfdfdd", "metadata": {}, "outputs": [], "source": [ "df.gender.str.contains('Male').astype(int)" ] }, { "cell_type": "code", "execution_count": null, "id": "da262bb7", "metadata": {}, "outputs": [], "source": [ "df['gender'] = df.gender.str.contains('Male').astype(int)" ] }, { "cell_type": "code", "execution_count": null, "id": "efac25be", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "7832a5b3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e26ad476", "metadata": {}, "source": [ "## 9. Sorting dataframes using `df.sort_values()`\n", "\n", ">Pandas data frame has two useful functions. **`df.sort_values()`** to sort by values of one or more columns and **`df.sort_index()`** to sort by the index. Each of these functions come with numerous options, like sorting in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.\n", "- The `df.sort_values()` function sort by the values along either axis.\n", "```\n", "df.sort_values(by,axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)\n", "```\n", "Where,\n", "- `by`: str or list of str to sort\n", "- `axis`: If `axis` is 0 or 'index' then 'by' may contain index levels and/or column labels. If `axis` is 1 or 'columns' then 'by' may contain column levels and/or index labels.\n", "- `ascending`: if True then ascending and if False then descending\n", "- `inplace`: If True, perform operation in-place.\n", "- `kind`: {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'. This option is only applied when sorting on a single column or label.\n", "- `na_position`: If first then puts NaNs at the beginning. Default is last\n", "- `ignore_index`: If True, the resulting axis will be labeled 0, 1, …, n - 1. Default False" ] }, { "cell_type": "code", "execution_count": null, "id": "b26bc45a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame({\n", " 'roll_no': [ 102, 101, 104, 103, 105],\n", " 'name' : ['Kamal', 'Saima', 'Jamal', 'Shaikh', 'Farzana'],\n", " 'gender' : ['M', 'F', 'M', 'M', 'F'],\n", " 'grade' : ['A', 'A', 'B', 'B', 'A'],\n", " 'marks' : [ 21, 23, 12, 14, 20],\n", " 'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']\n", "})\n", "df" ] }, { "cell_type": "markdown", "id": "0ca23f18", "metadata": {}, "source": [ "### a. Sorting by Single Column" ] }, { "cell_type": "code", "execution_count": null, "id": "8c2ceb82", "metadata": {}, "outputs": [], "source": [ "# Let us sort the data by grade column\n", "# By default the sorting is done in ascending order and is not inplace\n", "df1 = df.sort_values(by=['grade'])\n", "df1" ] }, { "cell_type": "markdown", "id": "aadaceac", "metadata": {}, "source": [ "- Note in above output, we have sorted the data based on the `grade` column. You can observe that some of the students with higher marks are ranked lower.\n", "- We want to sort the data based on both grades and marks." ] }, { "cell_type": "markdown", "id": "dca3a055", "metadata": {}, "source": [ "### b. Sorting by Multiple Columns" ] }, { "cell_type": "code", "execution_count": null, "id": "586130e6", "metadata": {}, "outputs": [], "source": [ "# sort the dataframe\n", "df2 = df.sort_values(by=['grade','marks'])\n", "df2" ] }, { "cell_type": "markdown", "id": "3e4d24bd", "metadata": {}, "source": [ "- Note that the data is first sorted by grade, and then within grade it is sorted by marks\n", "- But still the problem is not solved. Actually we want to sort by grade in ascending order and by marks in descending order.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a26aa1c0", "metadata": {}, "outputs": [], "source": [ "df3 = df.sort_values(by=['grade','marks'], ascending=[True,False])\n", "df3" ] }, { "cell_type": "markdown", "id": "9d7bfca2", "metadata": {}, "source": [ "### c. Reset the Index (if you want)\n", "- After you sort your dataset, you can observe that the index is also shuffled according to the sorting. If we want to reset the index we use `reset_index()` function.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a4cf9a60", "metadata": {}, "outputs": [], "source": [ "df3.reset_index()" ] }, { "cell_type": "markdown", "id": "fa157337", "metadata": {}, "source": [ "- Observe that now it has created another column 'index' which is the previous index. \n", "- If you want to remove this just pass the parameter `drop = True` and also `inplace = True` to save the state." ] }, { "cell_type": "code", "execution_count": null, "id": "92122811", "metadata": {}, "outputs": [], "source": [ "df3.reset_index(inplace=True, drop=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "be4b8bfd", "metadata": {}, "outputs": [], "source": [ "df3" ] }, { "cell_type": "markdown", "id": "f3702626", "metadata": {}, "source": [ "### d. Role of NaN Values in Sorting" ] }, { "cell_type": "code", "execution_count": null, "id": "e2d87fc2", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "5d098165", "metadata": {}, "outputs": [], "source": [ "# If there is a missing value NaN, by default it is listed at the end when using sort_values function\n", "# Regardless of the sorting order (Ascending or Descending)\n", "df.sort_values(by='scholarship')" ] }, { "cell_type": "code", "execution_count": null, "id": "1261bcc6", "metadata": {}, "outputs": [], "source": [ "# If the argument na_position='first', it will be listed at the top.\n", "df.sort_values(by=['scholarship'], na_position='first')" ] }, { "cell_type": "markdown", "id": "28f6d00a", "metadata": {}, "source": [ ">- Checkout the `df.nlargest()` method which return the first `n` rows ordered by `columns` in descending order.\n", ">- Checkout the `df.nsmallest()` method which return the first `n` rows ordered by `columns` in ascending order." ] }, { "cell_type": "code", "execution_count": null, "id": "286843f4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0aa9e97b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "19fa2853", "metadata": {}, "source": [ "## 10. Sorting dataframes using `df.sort_index()`\n", "> We have observed while using `df.sort_values()`, by default the sorting is performed in the vertical direction. If you want to sort in the row direction, we can set the`axis` argument of `df.sort_values()` method to 1, which is by default set to zero. However, it may cause problems when a number and a string are mixed\n", "\n", "- So to sort a dataframe in the horizontal direction, we normally use **`df.sort_index()`** method.\n", "```\n", "df.sort_index(axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)\n", "```\n", "Where,\n", "- `axis`: The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns. (default is 0)\n", "- `ascending`: If True then ascending and If False then descending\n", "- `inplace`: If True, perform operation in-place.\n", "- `kind`: {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'. This option is only applied when sorting on a single column or label.\n", "- `na_position`: If first then puts NaNs at the beginning. Default is last\n", "- `ignore_index`: If True, the resulting axis will be labeled 0, 1, …, n - 1. Default False" ] }, { "cell_type": "code", "execution_count": null, "id": "1925f8e3", "metadata": {}, "outputs": [], "source": [ "# Let us create a simple data frame\n", "import pandas as pd\n", "df = pd.DataFrame({\n", " 'roll_no': [ 102, 101, 104, 105, 103],\n", " 'name' : ['Kamal', 'Saima', 'Jamal','Farzana', 'Shaikh'],\n", " 'gender' : ['M', 'F', 'M', 'M', 'F'],\n", " 'grade' : ['A', 'A', 'B', 'B', 'A'],\n", " 'marks' : [ 21, 23, 12, 14, 20],\n", " 'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']\n", "})\n", "df" ] }, { "cell_type": "markdown", "id": "b4560468", "metadata": {}, "source": [ "### a. Sort by Column Labels\n", "- By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0" ] }, { "cell_type": "code", "execution_count": null, "id": "7f4f7e44", "metadata": {}, "outputs": [], "source": [ "df1 = df.sort_index(axis=1)\n", "df1" ] }, { "cell_type": "markdown", "id": "eb402229", "metadata": {}, "source": [ "### b. Sort by Index\n", "- The first question that might come in your mind, is that why do we need to sort by index. We can see in above dataframe the row indices are in numeric order. And if the row indices are disturbed due to may be sorting by the values of some column, we can use `reset_index()` method to make the row indices again in increasing numeric order.\n", "

\n", "- To understand this, let us follow following three steps:\n", " - Set the `roll_no` column as index\n", " - Call `sort_index()` with axis=0\n", " - Call `reset_index()`" ] }, { "cell_type": "code", "execution_count": null, "id": "cf2a28ba", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "id": "3fe78d5f", "metadata": {}, "source": [ "**Let us sort by roll_no**" ] }, { "cell_type": "code", "execution_count": null, "id": "a631c972", "metadata": {}, "outputs": [], "source": [ "# Lets us set the roll_no column as index\n", "df1 = df.set_index([\"roll_no\"])\n", "df1" ] }, { "cell_type": "markdown", "id": "7a463f5e", "metadata": {}, "source": [ ">Note that the implicit index collumn is dropped and the roll_no column has become the index of this dataframe." ] }, { "cell_type": "code", "execution_count": null, "id": "3249becc", "metadata": {}, "outputs": [], "source": [ "# sort the datframe by index \n", "df2 = df1.sort_index(axis=0)\n", "df2" ] }, { "cell_type": "code", "execution_count": null, "id": "5963639d", "metadata": {}, "outputs": [], "source": [ "# After sort you can reset the index if you want\n", "df3 = df2.reset_index(drop=False)\n", "df3" ] }, { "cell_type": "code", "execution_count": null, "id": "b94ae184", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }