---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.14 (Pandas-06)</h1>

<a href="https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-3-Python-for-Data-Scientists/Lec-3.14(Pandas-06-Modifying-Dataframes-I).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Modifying Dataframes Part-I_

In [None]:
# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install pandas

In [None]:
import pandas as pd
pd.__version__ , pd.__path__

## Learning agenda of this notebook
1. Modifying Column labels of Dataframe
2. Modifying Row indices of Dataframe
3. Modifying Row(s) Data (Records) of a Dataframe
   - Modifying a single Row
   - Modifying multiple Rows
       - `map()` Method
       - `df.remove()` Method
       - `df.apply()` Method
       - `df.applymap()` Method

##  Read a Sample Dataframe

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [None]:
# `shape` attribute of a dataframe object return a two value tuple containing rows and columns
# Note the rows count does not include the column labels and column count does not include the row index
df.shape

In [None]:
# `index` attribute of a dataframe object return the list of row indices and its datatype
df.index

In [None]:
# `columns` attribute of a dataframe object return the list of column labels and its datatype
df.columns

In [None]:
# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe
df.dtypes

## 1. Modifying Column Names of a Dataframe
- Every dataframe has column labels associated with its columns
- These by default are integer values from 0,1,2,3...
- However, while creating a dataframe from scratch, or while reading them from a file you can set them to more meaningful string values.
- While reading from csv file the first row in the file is taken as the column labels
- We can change the column labels, if we want
- Let us practically see this for better understanding

In [None]:
! cat datasets/groupdatawithoutcollables.csv

### a. While Reading a Dataset in a Dataframe
- Pass a List of column names to `names` argument of `pd.read_csv()` method

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdatawithoutcollables.csv', names = ['roll no', 'name', 'age', 'address', 'session', 
                                                                'group', 'gender','subj1', 'subj2', 'scholarship'])

df.head(3)

### b. After Dataframe is Loaded (Use `columns` attribute of dataframe)

In [None]:
df = pd.read_csv('datasets/groupdatawithoutcollables.csv', header = None)
df.head(3)

In [None]:
df.columns = ['roll no', 'name', 'age', 'address', 'session', 'group', 'gender', 'subj1', 'subj2', 'scholarship']
df.head(3)

>- Suppose we have a dataframe in which there are certain column labels having spaces in between the names.
>- We want to rename all such columns by replacing the space character with an underscore
>- One way to do this is call `replace()` method of String class on all the column names of dataframe

In [None]:
df.columns

In [None]:
df.columns.str.replace(' ', '_')

In [None]:
df.columns = df.columns.str.replace(' ', '_')

In [None]:
df.columns

In [None]:
df.head()

>- Suppose we have a dataframe in which there are column labels having names in different cases.
>- We want to rename all such columns such that the names are all lower or all upper case.
>- One way to do this is to generate a new list as per the requirement using List comprehension.

In [None]:
list1 = [x.upper() for x in df.columns]
list1

In [None]:
df.columns = list1
df.head(3)

### c. After Dataframe is Loaded (Use `df.rename()` method)
- What if your dataframe has lots and lots of columns having appropriate column names, and you just want to change just one or two column names and not all of them.
- Use `df.rename()` method to modify one or more column names to new one
```
df.rename(mapper, axis=None, inplace=False)
```
- Where,
    - `mapper`: can be a dictionary having comma separated key:value pairs, where, key is the old column name, while the value is the new column name
    - `axis`: If you want to change the column names use axis = 1 (column axis that moves from left to right)
    - `inplace`: If you want this change to occur inplace make this argument True, in which case the method will return None

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [None]:
#Since the inplace argument is by default False, so the rename() method will return a new dataframe
df.rename(mapper={'roll no': 'rollno', 'name':'fname'}, axis=1)

In [None]:
df.columns

In [None]:
#Since the inplace argument is now set to True, so the rename() method will return None
#however, the `df` will be changed
df.rename(mapper={ 'roll no': 'rollno'}, axis=1, inplace=True)

In [None]:
df.columns

## 2. Modifying Row Indices of a Dataframe
- Every dataframe has row index associated with every row, normally are integer values from 0,1,2,3...
- After you have sliced a datafreame on a condition or sorted a dataframe, these row indices will be randomized.
- We have seen in detail in our previous session the two methods namely `df.set_index()` and `df.reset_index()`, to handle this issue.

## 3. Modifying Data of a Single Row/Record of a Dataframe

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

### a.  Grep the row/record you want to modify
Let us suppose we want to change the `subj1` and `subj2` marks of Shaista

In [None]:
# Returns a Series object
df.loc[2,:]

In [None]:
# Returns a Dataframe object
df.loc[df.name=='Shaista', :]

### b.  Option 1:
- One way is to pass a new list of values and assign it to the appropriate series (row)

In [None]:
# Any of the following two LOC will work
df.loc[2,:] = ['MS03', 'Shaista', 35, 'Karachi', 'AFTERNOON', 'group B', 'Female', 99, 99, 8500.0]
df.loc[df.name=='Shaista', :] = ['MS03', 'Shaista', 35, 'Karachi', 'AFTERNOON', 'group B', 'Female', 99, 99, 8500.0]
df.head(3)

### c.  Option 2:
- A better way is to assign only those two values that we want to change instead of assigning the complete list of values in that row

In [None]:
# Returns a series
df.loc[2, ['subj1', 'subj2']] 

In [None]:
# Returns a dataframe
df.loc[df.name=='Shaista', ['subj1', 'subj2']]

In [None]:
df.loc[2, ['subj1', 'subj2']] = [100, 100]
df.loc[df.name=='Shaista', ['subj1', 'subj2']] = [100, 100]
df.head(3)

**Note: You can also use `df.iloc[]` method instead of `df.loc[]` to change multiple or single value of a row. Other than these two you may also try using `df.at[]` method to change a single value of a row.**
```
df.loc[filter, 'column(s)'] = 'value(s)'
```

## 4. Modify Data of Multiple Rows and 
- Uptill now we have learnt to modify a single, multiple or all the values of a single row in a dataframe.
- What if we want to modify multiple rows at a time?
- The following methods will come for your rescue:
    - `map()`
    - `df.replace()`
    - `df.apply()`
    - `df.applymap()`

### a. The Python Built-in `map()` Method
- The ```map(aFunction, *iterables)``` function simply returns a map object after applying  `aFunction()` to all the elements of `iterable(s)`. 
- Later you can type cast the map object to appropriate data structure
- The original iterable(s) remains unchanged. 

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

**Example:** Using built-in function with `map()`

In [None]:
# Passing a Series object (a column of dataframe) to map() as argument
# The Python built-in `len()` function is applied to all the values of name column and return a map object
map(len, df['name'])

In [None]:
# Type cast the map object to Series
pd.Series(map(len, df['name']))

In [None]:
# Another way is to call the map() method by a Series object using dot notation
df['name'].map(len)

In [None]:
# Third way is to access the column name as well using dot notation
df.name.map(len)

**Example:** Using a user-defined function with `map()`

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [None]:
# Let us pass a user-defined function
def myfunc(x):
    if (x <= 50):
        return "Young"
    else:
        return "Old"

df['age'].map(myfunc)

In [None]:
# If you want to save this as a new column in the dataframe you can do that
df['newcol'] = df['age'].map(myfunc)

In [None]:
df.head()

**Example:** Using a Lambda function with `map()`

In [None]:
df['age'].map(lambda x: "Young" if x<=50 else "Old")

**Example:** Using a Lambda Function with `map()`

In [None]:
# You cannot pass upper to map() as we have passed len to map() 
# as upper() is not a built-in function rather is a method of string class
#df['name'].map(upper)

In [None]:
df['name'].map(lambda x: x.upper())

**Example:** Passing a Dictionary {oldval:newval} to `map()` for changing selected values of a categorical column

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [None]:
df['session'].map({'MORNING':'M', 'AFTERNOON':'A'})

>**Limitations of `map()` Method**
>- If there are values for which there is no match, the old values are changed and have become NaN. Solution is use `df.replace()` method
>- You can use it on an iterable or Series object not with entire dataframe. Solution is use `df.apply()` and `df.applymap()`

### b. The `df.replace()` Method
- The `df.replace()` method is used to replace values given in `to_replace` with `value`
- The matching values in the entire dataframe are replaced with new values dynamically.
- This differs from updating with ``.loc`` or ``.iloc``, which require you to specify a location to update with some value.

```
df.replace(to_replace, value, inplace=False)
```

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [None]:
df['session'].replace({'MORNING':'M', 'AFTERNOON':'A'})

>- Note that now there are no NaN values, rather the values that do not have a match remains as such
>- Another important point is `replace()` method works equally well with dataframe

In [None]:
# Calling replace on entire dataframe
df.replace({'MORNING':'M', 'AFTERNOON':'A', 'group A':'GROUP-A'})

In [None]:
# Above operation is not inplace
df

### c. The `df.apply()` Method
- The `df.apply()` method is used to run a function along the mentioned axis of the dataframe. 
- In simple words, `apply()` method runs a function on all the elements of a series of a dataframe

```
df.apply(func, axis=0, args)
```
- Where,
    - `func`: It can be a built-in, user-defined or a lambda function that is applied to every series of the dataframe as per the axis argument. (Objects passed to the func are series objects)
    - `axis`: The default value of axis argument is zero, so the func is applied to each column. If you want to apply the func to the values of a row, mention axis as one.
    - `args` : If you want to pass additional arguments to `func` in addition to the element of series, you can pass them as a tuple.

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [None]:
# Let us pass the built-in function `len()` and compute the length of each name under the name column of df
# So now the len() method is applied to all the values of a single column and return a series object
df['name'].apply(len)

In [None]:
# Let us pass a user-defined function, with an additional argument as well. This was not possible with map() method
def myfunc(x, age):
    if (x <= age):
        return "Young"
    else:
        return "Old"

df['age'].apply(myfunc, args = (50,))

In [None]:
# Let us use Lambda function to convert each name under the name column of df to upper case
df['name'].apply(lambda x : x.upper())

In [None]:
def myfunc(x, age):
    if (x <= age):
        return "Young"
    else:
        return "Old"


In [None]:
# If you are satisfied with the result, you may assign it to the specific column
df['name'] = df['name'].apply(lambda x : x.upper())

In [None]:
# Verify
df.head(3)

In [None]:
# Can anyone guess what this LOC will do?
df['subj1'] = df['subj1'].apply(lambda x : x+5)

In [None]:
df.head(3)

>Uptill now we have applied the `df.apply()` method on a specific column of a dataframe. Let us apply it on a row of dataframe

In [None]:
# Since we have different dtypes in each row, so let us create a dataframe hving numeric columns only
df = pd.read_csv('datasets/groupdata.csv')
df_numeric = df.loc[:,['age','subj1','subj2','scholarship']]
df_string = df.loc[:,['roll no','name','address','session', 'group', 'gender']]

In [None]:
df_numeric.head()

In [None]:
# Although not much meaningful, let us add a number to each value of the row
df_numeric.loc[0].apply(lambda x : x+5)

In [None]:
# If you want to commit this to the datafream you can do that 

In [None]:
df_numeric.loc[0] = df_numeric.loc[0].apply(lambda x : x+5)

In [None]:
df_numeric.head()

>Let us use the `df.apply()` method on entire dataframe

In [None]:
df_numeric.apply(lambda x: x+5).head()

In [None]:
df.apply(min)

In [None]:
min(df['subj1'])

The `min()` function has been applied on each column of the dataframe and for each column the minimum value has been computed and the `df.apply()` method has returned a Series object

### b. The `df.applymap()` Method
- The `df.map()` method applies a function to datafreame element wise.

```
df.applymap(func, axis=0)
```
- Where,
    - `func`: A function that is passed a single value and returns a single value.
    
Note: A Series object do not have a `applymap()` method, so you cannot call it with a Series object

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df_string = df.loc[:,['roll no','name','address','session', 'group', 'gender']]
df_numeric = df.loc[:,['age','subj1','subj2','scholarship']]

In [None]:
df_string.head()

In [None]:
df_numeric.head()

In [None]:
df_string.head()

In [None]:
df_string.applymap(str.upper).head()

In [None]:
df_numeric.head(5)

In [None]:
# The applymap() method will apply the len function on each element of dataframe 
df_numeric.applymap(lambda x : x+5).head(5)