<img src="./images/shouke_logo.png"
 style="float: right"
 width=100
 style="padding-bottom:100px;"/>
<br>
<br>

<table style="float:center;">
 <tr>
 <td>
 <img src='./images/python-logo.png'width=130>
 </td>
 <td>
 <img src='./images/pandas-logo.png'width=150>
 </td>
 </tr>
</table>

<h1 style='text-align: center;'>Imputing Missing Values of DataFrame</h1>
<h3 style='text-align: center;'>Shouke Wei, Ph.D. Professor</h3>
<h4 style='text-align: center;'>Email: shouke.wei@gmail.com</h4>

## Objective
- learn how to drop, impute or fill the missing values of the dataframe

In [18]:
# import required packages
import pandas as pd

# read data 
df = pd.read_csv('./data/gdp_china_renamed.csv')

# display names of the columns
df.columns

Index(['prov', 'gdpr', 'year', 'gdp', 'pop', 'finv', 'trade', 'fexpen',
 'uinc'],
 dtype='object')

## 1. Drop missing values

In [19]:
df_new = df.dropna()

In [21]:
df_new.isna().sum().sum()

0

## 2. Fill missing values

In [23]:
idx_l = [3,4,5,22,23,24]

In [24]:
df.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,
5,9.194
22,7.458
23,
24,7.588


### (1) fill missing values with the mean value(s)

In [26]:
df_num = df.drop(['prov','gdpr'],axis=1)
df_new = df_num.fillna(df_num.mean())

In [27]:
df_new.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,8.321032
5,9.194
22,7.458
23,8.321032
24,7.588


### (2) fill with build-in methods

In [29]:
# foward fill
df_new = df.fillna(method='ffill')

In [30]:
df_new.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,8.963
5,9.194
22,7.458
23,7.458
24,7.588


In [31]:
# backward fill
df_new = df.fillna(method='bfill')

In [32]:
df_new.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,9.194
5,9.194
22,7.458
23,7.588
24,7.588


Refer to: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

### 3. fill with interpolate

In [33]:
# default is linear
df_new =df.interpolate() # method='linear'

In [34]:
df_new.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,9.0785
5,9.194
22,7.458
23,7.523
24,7.588


In [35]:
# polynomial method
df_new =df.interpolate(method='polynomial',order=2)

In [36]:
df_new.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,9.054257
5,9.194
22,7.458
23,7.522919
24,7.588


In [37]:
# cubicspline
df_new =df.interpolate(method='cubicspline',order=2)

In [38]:
df_new.loc[idx_l,['pop']]

Unnamed: 0,pop
3,8.963
4,9.052298
5,9.194
22,7.458
23,7.532276
24,7.588


Refer to:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

https://www.w3schools.com/python/pandas/ref_df_interpolate.asp

## 4. save data as csv file

In [40]:
df_new.to_csv('./data/gdp_china_mis_cl.csv',index=False)