{ "cells": [ { "cell_type": "markdown", "id": "4a87b5ef", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "ab0dc25c", "metadata": {}, "source": [ "

Lecture 3.14 (Pandas-06)

" ] }, { "cell_type": "markdown", "id": "ccd98b60", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "19f82705", "metadata": {}, "source": [ "\n", "\n", "## _Modifying Dataframes Part-I_" ] }, { "cell_type": "code", "execution_count": null, "id": "29077002", "metadata": {}, "outputs": [], "source": [ "# To install this library in Jupyter notebook\n", "#import sys\n", "#!{sys.executable} -m pip install pandas" ] }, { "cell_type": "code", "execution_count": null, "id": "9c0b2727", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.__version__ , pd.__path__" ] }, { "cell_type": "markdown", "id": "12db95e1", "metadata": {}, "source": [ "## Learning agenda of this notebook\n", "1. Modifying Column labels of Dataframe\n", "2. Modifying Row indices of Dataframe\n", "3. Modifying Row(s) Data (Records) of a Dataframe\n", " - Modifying a single Row\n", " - Modifying multiple Rows\n", " - `map()` Method\n", " - `df.remove()` Method\n", " - `df.apply()` Method\n", " - `df.applymap()` Method" ] }, { "cell_type": "code", "execution_count": null, "id": "329760ba", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "73464b5a", "metadata": {}, "source": [ "## Read a Sample Dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "6914759c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "f48af20d", "metadata": {}, "outputs": [], "source": [ "# `shape` attribute of a dataframe object return a two value tuple containing rows and columns\n", "# Note the rows count does not include the column labels and column count does not include the row index\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "8c00cc7d", "metadata": {}, "outputs": [], "source": [ "# `index` attribute of a dataframe object return the list of row indices and its datatype\n", "df.index" ] }, { "cell_type": "code", "execution_count": null, "id": "b8a7fc1b", "metadata": {}, "outputs": [], "source": [ "# `columns` attribute of a dataframe object return the list of column labels and its datatype\n", "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "899d087f", "metadata": {}, "outputs": [], "source": [ "# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "id": "846b966a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5cc6dae4", "metadata": {}, "source": [ "## 1. Modifying Column Names of a Dataframe\n", "- Every dataframe has column labels associated with its columns\n", "- These by default are integer values from 0,1,2,3...\n", "- However, while creating a dataframe from scratch, or while reading them from a file you can set them to more meaningful string values.\n", "- While reading from csv file the first row in the file is taken as the column labels\n", "- We can change the column labels, if we want\n", "- Let us practically see this for better understanding" ] }, { "cell_type": "code", "execution_count": null, "id": "81a7d7d3", "metadata": {}, "outputs": [], "source": [ "! cat datasets/groupdatawithoutcollables.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "82bfadbb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6a8c8924", "metadata": {}, "source": [ "### a. While Reading a Dataset in a Dataframe\n", "- Pass a List of column names to `names` argument of `pd.read_csv()` method" ] }, { "cell_type": "code", "execution_count": null, "id": "62b72b08", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdatawithoutcollables.csv', names = ['roll no', 'name', 'age', 'address', 'session', \n", " 'group', 'gender','subj1', 'subj2', 'scholarship'])\n", "\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "a7de4650", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b77fcfe7", "metadata": {}, "source": [ "### b. After Dataframe is Loaded (Use `columns` attribute of dataframe)" ] }, { "cell_type": "code", "execution_count": null, "id": "8523eb8d", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdatawithoutcollables.csv', header = None)\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "b1f7ec9a", "metadata": {}, "outputs": [], "source": [ "df.columns = ['roll no', 'name', 'age', 'address', 'session', 'group', 'gender', 'subj1', 'subj2', 'scholarship']\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "375db8e0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "245f0d9f", "metadata": {}, "source": [ ">- Suppose we have a dataframe in which there are certain column labels having spaces in between the names.\n", ">- We want to rename all such columns by replacing the space character with an underscore\n", ">- One way to do this is call `replace()` method of String class on all the column names of dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "3b4503bf", "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "607b9dd0", "metadata": {}, "outputs": [], "source": [ "df.columns.str.replace(' ', '_')" ] }, { "cell_type": "code", "execution_count": null, "id": "8fbbf61f", "metadata": {}, "outputs": [], "source": [ "df.columns = df.columns.str.replace(' ', '_')" ] }, { "cell_type": "code", "execution_count": null, "id": "7e2929b0", "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "6b2644f5", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "78233da4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c7fc9bde", "metadata": {}, "source": [ ">- Suppose we have a dataframe in which there are column labels having names in different cases.\n", ">- We want to rename all such columns such that the names are all lower or all upper case.\n", ">- One way to do this is to generate a new list as per the requirement using List comprehension." ] }, { "cell_type": "code", "execution_count": null, "id": "fcd1431a", "metadata": {}, "outputs": [], "source": [ "list1 = [x.upper() for x in df.columns]\n", "list1" ] }, { "cell_type": "code", "execution_count": null, "id": "92b2e1a3", "metadata": {}, "outputs": [], "source": [ "df.columns = list1\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "647464ad", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d5e388c9", "metadata": {}, "source": [ "### c. After Dataframe is Loaded (Use `df.rename()` method)\n", "- What if your dataframe has lots and lots of columns having appropriate column names, and you just want to change just one or two column names and not all of them.\n", "- Use `df.rename()` method to modify one or more column names to new one\n", "```\n", "df.rename(mapper, axis=None, inplace=False)\n", "```\n", "- Where,\n", " - `mapper`: can be a dictionary having comma separated key:value pairs, where, key is the old column name, while the value is the new column name\n", " - `axis`: If you want to change the column names use axis = 1 (column axis that moves from left to right)\n", " - `inplace`: If you want this change to occur inplace make this argument True, in which case the method will return None" ] }, { "cell_type": "code", "execution_count": null, "id": "01c9b321", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "b8e0a405", "metadata": {}, "outputs": [], "source": [ "#Since the inplace argument is by default False, so the rename() method will return a new dataframe\n", "df.rename(mapper={'roll no': 'rollno', 'name':'fname'}, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "9e64955e", "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "68b1215a", "metadata": {}, "outputs": [], "source": [ "#Since the inplace argument is now set to True, so the rename() method will return None\n", "#however, the `df` will be changed\n", "df.rename(mapper={ 'roll no': 'rollno'}, axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "d517045f", "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "id": "27dd7509", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ca6f31e3", "metadata": {}, "source": [ "## 2. Modifying Row Indices of a Dataframe\n", "- Every dataframe has row index associated with every row, normally are integer values from 0,1,2,3...\n", "- After you have sliced a datafreame on a condition or sorted a dataframe, these row indices will be randomized.\n", "- We have seen in detail in our previous session the two methods namely `df.set_index()` and `df.reset_index()`, to handle this issue." ] }, { "cell_type": "code", "execution_count": null, "id": "2cf5b5d7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0d66a0b7", "metadata": {}, "source": [ "## 3. Modifying Data of a Single Row/Record of a Dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "8b374f27", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "85715c96", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c5991fdc", "metadata": {}, "source": [ "### a. Grep the row/record you want to modify\n", "Let us suppose we want to change the `subj1` and `subj2` marks of Shaista" ] }, { "cell_type": "code", "execution_count": null, "id": "4437b197", "metadata": {}, "outputs": [], "source": [ "# Returns a Series object\n", "df.loc[2,:]" ] }, { "cell_type": "code", "execution_count": null, "id": "f80d1c9e", "metadata": {}, "outputs": [], "source": [ "# Returns a Dataframe object\n", "df.loc[df.name=='Shaista', :]" ] }, { "cell_type": "code", "execution_count": null, "id": "b1ffbdf9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0b457b3b", "metadata": {}, "source": [ "### b. Option 1:\n", "- One way is to pass a new list of values and assign it to the appropriate series (row)" ] }, { "cell_type": "code", "execution_count": null, "id": "8ffa376a", "metadata": {}, "outputs": [], "source": [ "# Any of the following two LOC will work\n", "df.loc[2,:] = ['MS03', 'Shaista', 35, 'Karachi', 'AFTERNOON', 'group B', 'Female', 99, 99, 8500.0]\n", "df.loc[df.name=='Shaista', :] = ['MS03', 'Shaista', 35, 'Karachi', 'AFTERNOON', 'group B', 'Female', 99, 99, 8500.0]\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "f76ef336", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0c6574ca", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "da44dcc2", "metadata": {}, "source": [ "### c. Option 2:\n", "- A better way is to assign only those two values that we want to change instead of assigning the complete list of values in that row" ] }, { "cell_type": "code", "execution_count": null, "id": "578eb3ae", "metadata": {}, "outputs": [], "source": [ "# Returns a series\n", "df.loc[2, ['subj1', 'subj2']] " ] }, { "cell_type": "code", "execution_count": null, "id": "9d4fdeae", "metadata": {}, "outputs": [], "source": [ "# Returns a dataframe\n", "df.loc[df.name=='Shaista', ['subj1', 'subj2']]" ] }, { "cell_type": "code", "execution_count": null, "id": "a134245a", "metadata": {}, "outputs": [], "source": [ "df.loc[2, ['subj1', 'subj2']] = [100, 100]\n", "df.loc[df.name=='Shaista', ['subj1', 'subj2']] = [100, 100]\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "dca30c2e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1614e432", "metadata": {}, "source": [ "**Note: You can also use `df.iloc[]` method instead of `df.loc[]` to change multiple or single value of a row. Other than these two you may also try using `df.at[]` method to change a single value of a row.**\n", "```\n", "df.loc[filter, 'column(s)'] = 'value(s)'\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "fc0ef9db", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "22806e44", "metadata": {}, "source": [ "## 4. Modify Data of Multiple Rows and \n", "- Uptill now we have learnt to modify a single, multiple or all the values of a single row in a dataframe.\n", "- What if we want to modify multiple rows at a time?\n", "- The following methods will come for your rescue:\n", " - `map()`\n", " - `df.replace()`\n", " - `df.apply()`\n", " - `df.applymap()`" ] }, { "cell_type": "code", "execution_count": null, "id": "6c67f6eb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "27344570", "metadata": {}, "source": [ "### a. The Python Built-in `map()` Method\n", "- The ```map(aFunction, *iterables)``` function simply returns a map object after applying `aFunction()` to all the elements of `iterable(s)`. \n", "- Later you can type cast the map object to appropriate data structure\n", "- The original iterable(s) remains unchanged. " ] }, { "cell_type": "code", "execution_count": null, "id": "ae055249", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head(3)" ] }, { "cell_type": "markdown", "id": "3558b5ca", "metadata": {}, "source": [ "**Example:** Using built-in function with `map()`" ] }, { "cell_type": "code", "execution_count": null, "id": "c47b33b2", "metadata": {}, "outputs": [], "source": [ "# Passing a Series object (a column of dataframe) to map() as argument\n", "# The Python built-in `len()` function is applied to all the values of name column and return a map object\n", "map(len, df['name'])" ] }, { "cell_type": "code", "execution_count": null, "id": "2b3baed2", "metadata": {}, "outputs": [], "source": [ "# Type cast the map object to Series\n", "pd.Series(map(len, df['name']))" ] }, { "cell_type": "code", "execution_count": null, "id": "9e1b2a3b", "metadata": {}, "outputs": [], "source": [ "# Another way is to call the map() method by a Series object using dot notation\n", "df['name'].map(len)" ] }, { "cell_type": "code", "execution_count": null, "id": "c7e98bd2", "metadata": {}, "outputs": [], "source": [ "# Third way is to access the column name as well using dot notation\n", "df.name.map(len)" ] }, { "cell_type": "code", "execution_count": null, "id": "11d99b7c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "72855f41", "metadata": {}, "source": [ "**Example:** Using a user-defined function with `map()`" ] }, { "cell_type": "code", "execution_count": null, "id": "c81ff087", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "95d31d15", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b471c19e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f8818e2f", "metadata": {}, "outputs": [], "source": [ "# Let us pass a user-defined function\n", "def myfunc(x):\n", " if (x <= 50):\n", " return \"Young\"\n", " else:\n", " return \"Old\"\n", "\n", "df['age'].map(myfunc)" ] }, { "cell_type": "code", "execution_count": null, "id": "7a67f5ed", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0971eb8d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3384d3a1", "metadata": {}, "outputs": [], "source": [ "# If you want to save this as a new column in the dataframe you can do that\n", "df['newcol'] = df['age'].map(myfunc)" ] }, { "cell_type": "code", "execution_count": null, "id": "72fb2d86", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "c5b44991", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "110ddb24", "metadata": {}, "source": [ "**Example:** Using a Lambda function with `map()`" ] }, { "cell_type": "code", "execution_count": null, "id": "ee6094cc", "metadata": {}, "outputs": [], "source": [ "df['age'].map(lambda x: \"Young\" if x<=50 else \"Old\")" ] }, { "cell_type": "code", "execution_count": null, "id": "eec63155", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f4d7f77d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ac1507aa", "metadata": {}, "source": [ "**Example:** Using a Lambda Function with `map()`" ] }, { "cell_type": "code", "execution_count": null, "id": "21ae1009", "metadata": {}, "outputs": [], "source": [ "# You cannot pass upper to map() as we have passed len to map() \n", "# as upper() is not a built-in function rather is a method of string class\n", "#df['name'].map(upper)" ] }, { "cell_type": "code", "execution_count": null, "id": "b6e20b71", "metadata": {}, "outputs": [], "source": [ "df['name'].map(lambda x: x.upper())" ] }, { "cell_type": "code", "execution_count": null, "id": "f91462b6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f41e8162", "metadata": {}, "source": [ "**Example:** Passing a Dictionary {oldval:newval} to `map()` for changing selected values of a categorical column" ] }, { "cell_type": "code", "execution_count": null, "id": "cb9d4a53", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "f16298b5", "metadata": {}, "outputs": [], "source": [ "df['session'].map({'MORNING':'M', 'AFTERNOON':'A'})" ] }, { "cell_type": "markdown", "id": "4e8b549e", "metadata": {}, "source": [ ">**Limitations of `map()` Method**\n", ">- If there are values for which there is no match, the old values are changed and have become NaN. Solution is use `df.replace()` method\n", ">- You can use it on an iterable or Series object not with entire dataframe. Solution is use `df.apply()` and `df.applymap()`" ] }, { "cell_type": "markdown", "id": "64166c46", "metadata": {}, "source": [ "### b. The `df.replace()` Method\n", "- The `df.replace()` method is used to replace values given in `to_replace` with `value`\n", "- The matching values in the entire dataframe are replaced with new values dynamically.\n", "- This differs from updating with ``.loc`` or ``.iloc``, which require you to specify a location to update with some value.\n", "\n", "```\n", "df.replace(to_replace, value, inplace=False)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "936ad59d", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "0ddb01aa", "metadata": {}, "outputs": [], "source": [ "df['session'].replace({'MORNING':'M', 'AFTERNOON':'A'})" ] }, { "cell_type": "markdown", "id": "19f8d7fc", "metadata": {}, "source": [ ">- Note that now there are no NaN values, rather the values that do not have a match remains as such\n", ">- Another important point is `replace()` method works equally well with dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "475e1674", "metadata": {}, "outputs": [], "source": [ "# Calling replace on entire dataframe\n", "df.replace({'MORNING':'M', 'AFTERNOON':'A', 'group A':'GROUP-A'})" ] }, { "cell_type": "code", "execution_count": null, "id": "c21ed5ea", "metadata": {}, "outputs": [], "source": [ "# Above operation is not inplace\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "10e3d73a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "163c7161", "metadata": {}, "source": [ "### c. The `df.apply()` Method\n", "- The `df.apply()` method is used to run a function along the mentioned axis of the dataframe. \n", "- In simple words, `apply()` method runs a function on all the elements of a series of a dataframe\n", "\n", "```\n", "df.apply(func, axis=0, args)\n", "```\n", "- Where,\n", " - `func`: It can be a built-in, user-defined or a lambda function that is applied to every series of the dataframe as per the axis argument. (Objects passed to the func are series objects)\n", " - `axis`: The default value of axis argument is zero, so the func is applied to each column. If you want to apply the func to the values of a row, mention axis as one.\n", " - `args` : If you want to pass additional arguments to `func` in addition to the element of series, you can pass them as a tuple." ] }, { "cell_type": "code", "execution_count": null, "id": "8ba6146d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "d04dd0c2", "metadata": {}, "outputs": [], "source": [ "# Let us pass the built-in function `len()` and compute the length of each name under the name column of df\n", "# So now the len() method is applied to all the values of a single column and return a series object\n", "df['name'].apply(len)" ] }, { "cell_type": "code", "execution_count": null, "id": "7a02b8a3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f59a2df2", "metadata": {}, "outputs": [], "source": [ "# Let us pass a user-defined function, with an additional argument as well. This was not possible with map() method\n", "def myfunc(x, age):\n", " if (x <= age):\n", " return \"Young\"\n", " else:\n", " return \"Old\"\n", "\n", "df['age'].apply(myfunc, args = (50,))" ] }, { "cell_type": "code", "execution_count": null, "id": "0afa2e82", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "34028461", "metadata": {}, "outputs": [], "source": [ "# Let us use Lambda function to convert each name under the name column of df to upper case\n", "df['name'].apply(lambda x : x.upper())" ] }, { "cell_type": "code", "execution_count": null, "id": "a9a15c48", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ddc94763", "metadata": {}, "outputs": [], "source": [ "def myfunc(x, age):\n", " if (x <= age):\n", " return \"Young\"\n", " else:\n", " return \"Old\"\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b0dc26bb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d95ca771", "metadata": {}, "outputs": [], "source": [ "# If you are satisfied with the result, you may assign it to the specific column\n", "df['name'] = df['name'].apply(lambda x : x.upper())" ] }, { "cell_type": "code", "execution_count": null, "id": "60d03926", "metadata": {}, "outputs": [], "source": [ "# Verify\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "bbb43aa2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "05d408f2", "metadata": {}, "outputs": [], "source": [ "# Can anyone guess what this LOC will do?\n", "df['subj1'] = df['subj1'].apply(lambda x : x+5)" ] }, { "cell_type": "code", "execution_count": null, "id": "2da58c62", "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "c526a02a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "58285888", "metadata": {}, "source": [ ">Uptill now we have applied the `df.apply()` method on a specific column of a dataframe. Let us apply it on a row of dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "ab9198ee", "metadata": {}, "outputs": [], "source": [ "# Since we have different dtypes in each row, so let us create a dataframe hving numeric columns only\n", "df = pd.read_csv('datasets/groupdata.csv')\n", "df_numeric = df.loc[:,['age','subj1','subj2','scholarship']]\n", "df_string = df.loc[:,['roll no','name','address','session', 'group', 'gender']]" ] }, { "cell_type": "code", "execution_count": null, "id": "7e2f22c0", "metadata": {}, "outputs": [], "source": [ "df_numeric.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b2d4e516", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2935b45d", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Although not much meaningful, let us add a number to each value of the row\n", "df_numeric.loc[0].apply(lambda x : x+5)" ] }, { "cell_type": "code", "execution_count": null, "id": "5abc507b", "metadata": {}, "outputs": [], "source": [ "# If you want to commit this to the datafream you can do that " ] }, { "cell_type": "code", "execution_count": null, "id": "13a840c5", "metadata": {}, "outputs": [], "source": [ "df_numeric.loc[0] = df_numeric.loc[0].apply(lambda x : x+5)" ] }, { "cell_type": "code", "execution_count": null, "id": "172cebcc", "metadata": {}, "outputs": [], "source": [ "df_numeric.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "776c187d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bcb214af", "metadata": {}, "source": [ ">Let us use the `df.apply()` method on entire dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "509f699e", "metadata": {}, "outputs": [], "source": [ "df_numeric.apply(lambda x: x+5).head()" ] }, { "cell_type": "code", "execution_count": null, "id": "11c4707c", "metadata": {}, "outputs": [], "source": [ "df.apply(min)" ] }, { "cell_type": "code", "execution_count": null, "id": "d28b2bf8", "metadata": {}, "outputs": [], "source": [ "min(df['subj1'])" ] }, { "cell_type": "markdown", "id": "614994f1", "metadata": {}, "source": [ "The `min()` function has been applied on each column of the dataframe and for each column the minimum value has been computed and the `df.apply()` method has returned a Series object" ] }, { "cell_type": "code", "execution_count": null, "id": "0648a5e9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9eb7892d", "metadata": {}, "source": [ "### b. The `df.applymap()` Method\n", "- The `df.map()` method applies a function to datafreame element wise.\n", "\n", "```\n", "df.applymap(func, axis=0)\n", "```\n", "- Where,\n", " - `func`: A function that is passed a single value and returns a single value.\n", " \n", "Note: A Series object do not have a `applymap()` method, so you cannot call it with a Series object" ] }, { "cell_type": "code", "execution_count": null, "id": "cb6d7604", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('datasets/groupdata.csv')\n", "df_string = df.loc[:,['roll no','name','address','session', 'group', 'gender']]\n", "df_numeric = df.loc[:,['age','subj1','subj2','scholarship']]" ] }, { "cell_type": "code", "execution_count": null, "id": "40f21845", "metadata": {}, "outputs": [], "source": [ "df_string.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "16f8d4d7", "metadata": {}, "outputs": [], "source": [ "df_numeric.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "efc3de7a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2b2943aa", "metadata": {}, "outputs": [], "source": [ "df_string.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "e7af3ae0", "metadata": {}, "outputs": [], "source": [ "df_string.applymap(str.upper).head()" ] }, { "cell_type": "code", "execution_count": null, "id": "74f80ea6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b4c83144", "metadata": {}, "outputs": [], "source": [ "df_numeric.head(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "6e14f875", "metadata": {}, "outputs": [], "source": [ "# The applymap() method will apply the len function on each element of dataframe \n", "df_numeric.applymap(lambda x : x+5).head(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "b94ae184", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }