<img src="./images/shouke_logo.png"
     style="float: right"
     width=100
     style="padding-bottom:100px;"/>
<br>
<br>

<table style="float:center;">
    <tr>
        <td>
            <img src='./images/python-logo.png'width=120>
        </td>
        <td>
            <img src='./images/pandas-logo.png'width=150>
        </td>
        <td>
            <img src='./images/scikit_learn_logo.png'width=150>
        </td>
    </tr>
</table>

<h1 style='text-align: center;'>Normalizing the Data</h1>
<h3 style='text-align: center;'>Shouke Wei, Ph.D. Professor</h3>
<h4 style='text-align: center;'>Email: shouke.wei@gmail.com</h4>

## Objective 
- learn how to normalize the features, save and load the normalization scaler for new data

In [5]:
# import required packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# read data
df = pd.read_csv('./data/gdp_china_encoded.csv')

# show the first 5 rows
df.head()

Unnamed: 0,year,gdp,pop,finv,trade,fexpen,uinc,prov_hn,prov_js,prov_sd,prov_zj
0,2000,1.074125,8.65,0.314513,1.408147,0.108032,0.976157,0.0,0.0,0.0,0.0
1,2001,1.203925,8.733,0.348443,1.501391,0.132133,1.041519,0.0,0.0,0.0,0.0
2,2002,1.350242,8.842,0.385078,1.830169,0.152108,1.11372,0.0,0.0,0.0,0.0
3,2003,1.584464,8.963,0.48132,2.346735,0.169563,1.238043,0.0,0.0,0.0,0.0
4,2004,1.886462,9.052298,0.587002,2.955899,0.185295,1.362765,0.0,0.0,0.0,0.0


### Slice data into features X and target y

In [6]:
X = df.drop(['gdp'],axis=1)
y = df['gdp']

### Split train and test data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30, random_state=1)

In [8]:
X_train

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_hn,prov_js,prov_sd,prov_zj
66,2009,5.276,1.074232,1.282390,0.265335,2.461081,0.0,0.0,0.0,1.0
54,2016,9.947,5.332294,1.547657,0.875521,3.401208,0.0,0.0,1.0,0.0
36,2017,7.656,5.327700,3.999750,1.062103,4.362180,0.0,1.0,0.0,0.0
45,2007,9.367,1.253770,0.931296,0.226185,1.426470,0.0,0.0,1.0,0.0
52,2014,9.789,4.249555,1.701122,0.717731,2.922194,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
75,2018,5.155,3.169770,2.851160,0.862953,5.557430,0.0,0.0,0.0,1.0
9,2009,10.130,1.293312,4.174383,0.433437,2.157472,0.0,0.0,0.0,0.0
72,2015,5.539,2.732332,2.159908,0.664598,4.371448,0.0,0.0,0.0,1.0
12,2012,10.594,1.875150,6.211629,0.738786,3.022671,0.0,0.0,0.0,0.0


## 1. Normalization and Standardization

The terms standardize and normalize are used interchangeably in data preprocessing, although in statistics, the latter term also has other connotations.

The process of normalization involves transforming the data to a smaller or common range such as [−1,1] or [0, 1].

## 2. Why data normalization?

Normalization:
 
- gives all attributes an equal weight
- avoids dependence on the measurement units
- particularly useful for machine learning training or
- helps speed up the learning phase

In a linear regression model, it can help too though it is not necessary.

## 3. Methods for data normalization
#### Min-max normalization: 
$$x'=\frac{x - min(x)}{max(x) - min(x)}$$

$$x'=\frac{x - min(x)}{max(x) - min(x)}(new\_max(x)-new\_min(x)) + new\_min(x)$$

#### Mean normalization

$$x'=\frac{x - mean(x)}{max(x) - min(x)}$$ 

#### Z-score normalization / Standardization

$$x'=\frac{x - \mu}{\sigma}$$

$$μ: \text{the mean of the variable,}$$
$$σ: \text{is the standard deviation of the variable.}$$

#### Scaling to unit length

$$x'=\frac{x}{||x||}$$
$$||x||: \text{the Euclidean length of the variable}.$$

#### Decimal scaling 
$$x'=\frac{x}{10^j}$$

$$ j: \text{the smallest integer such that max(|x'|)<1}$$

## 4. Sklearn built-in methods for data normalization

### (1) MinMaxScaler
- Transform features by scaling each feature to a given range

### (2) MaxAbsScaler 
- Scale each feature by its maximum absolute value [-1, 1] by dividing through the largest maximum value

### (3) RobustScaler
- Scale features using statistics that are robust to outliers.It subtracts the column median and divides by the interquartile range.

### (4) StandardScaler
- StandardScaler scales each column to have 0 mean and unit variance.

### (5) Normalizer
Normalize samples individually to unit norm. The normalizer operates on the rows rather than the columns. It applies l2 normalization by default.

Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

###  `MinMaxScaler` Example:

#### (1) Normaliz the trainning dataset

In [16]:
# slice the continous features from the training data
X_train_continuous = X_train.loc[:,'year':'uinc']
X_train_continuous

Unnamed: 0,year,pop,finv,trade,fexpen,uinc
66,2009,5.276,1.074232,1.282390,0.265335,2.461081
54,2016,9.947,5.332294,1.547657,0.875521,3.401208
36,2017,7.656,5.327700,3.999750,1.062103,4.362180
45,2007,9.367,1.253770,0.931296,0.226185,1.426470
52,2014,9.789,4.249555,1.701122,0.717731,2.922194
...,...,...,...,...,...,...
75,2018,5.155,3.169770,2.851160,0.862953,5.557430
9,2009,10.130,1.293312,4.174383,0.433437,2.157472
72,2015,5.539,2.732332,2.159908,0.664598,4.371448
12,2012,10.594,1.875150,6.211629,0.738786,3.022671


In [18]:
# to learn the underlying parameters of the scaler from the training data-set
min_max_scaler = MinMaxScaler().fit(X_train_continuous)

#  transform the training data-set to range [0,1]
X_train_continuous_scaled = min_max_scaler.transform(X_train_continuous)

# convert it into dataframe
X_train_continuous_scaled = pd.DataFrame(X_train_continuous_scaled,index=X_train_continuous.index,
                                        columns=X_train_continuous.columns)
X_train_continuous_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc
66,0.500000,0.094319,0.173947,0.176927,0.145251,0.390579
54,0.888889,0.833518,0.964883,0.214073,0.544119,0.575614
36,0.944444,0.470961,0.964029,0.557440,0.666084,0.764752
45,0.388889,0.741731,0.207296,0.127763,0.119660,0.186948
52,0.777778,0.808514,0.763764,0.235562,0.440974,0.481335
...,...,...,...,...,...,...
75,1.000000,0.075170,0.563194,0.396602,0.535903,1.000000
9,0.500000,0.862478,0.214641,0.581894,0.255137,0.330823
72,0.833333,0.135939,0.481940,0.299806,0.406242,0.766576
12,0.666667,0.935908,0.322718,0.867170,0.454738,0.501111


In [19]:
# diplay the full scaled train dataset 
X_train_scaled = X_train.copy()
X_train_scaled.loc[:,'year':'uinc'] = X_train_continuous_scaled
X_train_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_hn,prov_js,prov_sd,prov_zj
66,0.500000,0.094319,0.173947,0.176927,0.145251,0.390579,0.0,0.0,0.0,1.0
54,0.888889,0.833518,0.964883,0.214073,0.544119,0.575614,0.0,0.0,1.0,0.0
36,0.944444,0.470961,0.964029,0.557440,0.666084,0.764752,0.0,1.0,0.0,0.0
45,0.388889,0.741731,0.207296,0.127763,0.119660,0.186948,0.0,0.0,1.0,0.0
52,0.777778,0.808514,0.763764,0.235562,0.440974,0.481335,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
75,1.000000,0.075170,0.563194,0.396602,0.535903,1.000000,0.0,0.0,0.0,1.0
9,0.500000,0.862478,0.214641,0.581894,0.255137,0.330823,0.0,0.0,0.0,0.0
72,0.833333,0.135939,0.481940,0.299806,0.406242,0.766576,0.0,0.0,0.0,1.0
12,0.666667,0.935908,0.322718,0.867170,0.454738,0.501111,0.0,0.0,0.0,0.0


#### (2) Normaliz the testing dataset

In [20]:
# slice the continous features from the testing data
X_test_continuous = X_test.loc[:,'year':'uinc']
#  transforme the testing data-set to range [0,1] using the training scaler
X_test_continuous_scaled = min_max_scaler.transform(X_test_continuous)

# convert it into dataframe
X_test_continuous_scaled = pd.DataFrame(X_test_continuous_scaled,index=X_test_continuous.index,
                                        columns=X_test_continuous.columns)
X_test_continuous_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc
40,0.111111,0.696629,0.039111,0.036688,0.028066,0.056056
31,0.666667,0.512739,0.547526,0.481719,0.431193,0.490291
46,0.444444,0.749644,0.261131,0.151409,0.148605,0.227113
58,0.055556,0.007754,0.027068,0.035371,0.010851,0.112156
77,0.055556,0.771483,0.003089,0.000578,0.005052,0.009864
49,0.611111,0.78446,0.471284,0.210696,0.298783,0.354778
87,0.611111,0.745055,0.304467,0.026858,0.249544,0.2643
44,0.333333,0.732553,0.180803,0.10364,0.091655,0.146158
88,0.666667,0.747903,0.372843,0.043088,0.299066,0.308541
90,0.777778,0.752651,0.546188,0.053241,0.365891,0.372103


In [21]:
# diplay the full scaled train dataset 
X_test_scaled = X_test.copy()
X_test_scaled.loc[:,'year':'uinc'] = X_test_continuous_scaled
X_test_scaled

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_hn,prov_js,prov_sd,prov_zj
40,0.111111,0.696629,0.039111,0.036688,0.028066,0.056056,0.0,0.0,1.0,0.0
31,0.666667,0.512739,0.547526,0.481719,0.431193,0.490291,0.0,1.0,0.0,0.0
46,0.444444,0.749644,0.261131,0.151409,0.148605,0.227113,0.0,0.0,1.0,0.0
58,0.055556,0.007754,0.027068,0.035371,0.010851,0.112156,0.0,0.0,0.0,1.0
77,0.055556,0.771483,0.003089,0.000578,0.005052,0.009864,1.0,0.0,0.0,0.0
49,0.611111,0.78446,0.471284,0.210696,0.298783,0.354778,0.0,0.0,1.0,0.0
87,0.611111,0.745055,0.304467,0.026858,0.249544,0.2643,1.0,0.0,0.0,0.0
44,0.333333,0.732553,0.180803,0.10364,0.091655,0.146158,0.0,0.0,1.0,0.0
88,0.666667,0.747903,0.372843,0.043088,0.299066,0.308541,1.0,0.0,0.0,0.0
90,0.777778,0.752651,0.546188,0.053241,0.365891,0.372103,1.0,0.0,0.0,0.0


## 7. Save and load the training scaler

In [22]:
import joblib
joblib.dump(min_max_scaler,'mm_scaler')

['mm_scaler']

In [25]:
import joblib
mm_scaler = joblib.load('mm_scaler')

In [26]:
X_test_continuous_scaled2 = mm_scaler.transform(X_test_continuous)
X_test_continuous_scaled2

array([[1.11111111e-01, 6.96629213e-01, 3.91109924e-02, 3.66880369e-02,
        2.80658336e-02, 5.60560888e-02],
       [6.66666667e-01, 5.12739357e-01, 5.47525660e-01, 4.81719398e-01,
        4.31192786e-01, 4.90290710e-01],
       [4.44444444e-01, 7.49643931e-01, 2.61131077e-01, 1.51408782e-01,
        1.48605435e-01, 2.27112677e-01],
       [5.55555556e-02, 7.75439152e-03, 2.70675105e-02, 3.53714159e-02,
        1.08511200e-02, 1.12155675e-01],
       [5.55555556e-02, 7.71482830e-01, 3.08939634e-03, 5.78018498e-04,
        5.05165395e-03, 9.86379321e-03],
       [6.11111111e-01, 7.84459566e-01, 4.71284143e-01, 2.10695512e-01,
        2.98782975e-01, 3.54778102e-01],
       [6.11111111e-01, 7.45054597e-01, 3.04466957e-01, 2.68583659e-02,
        2.49544384e-01, 2.64299509e-01],
       [3.33333333e-01, 7.32552619e-01, 1.80803243e-01, 1.03640173e-01,
        9.16553580e-02, 1.46157577e-01],
       [6.66666667e-01, 7.47903149e-01, 3.72842512e-01, 4.30876725e-02,
        2.99066019e-01, 