# Scaling data with `scikit-learn`

Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by `scikit-learn`'s `preprocessing` module: 

In [1]:
import io
import pandas
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

Let us assume we have the following data read using `pandas.read_csv` function:

In [2]:
file_content = io.StringIO("""age;weight;height
23;70;180
22;65;160
31;80;190
26;80;175
22;65;170
""")

df = pandas.read_csv(file_content, sep=";")
print(df)

 age weight height
0 23 70 180
1 22 65 160
2 31 80 190
3 26 80 175
4 22 65 170


All three variables have rather different means and variance, which can be problematic for some machine learning tools:

In [3]:
print("Means")
print(df.mean(axis=0))
print("\nStandard deviations")
print(df.std(axis=0))

Means
age 24.8
weight 72.0
height 175.0
dtype: float64

Standard deviations
age 3.834058
weight 7.582875
height 11.180340
dtype: float64


In this case, `scikit-learn` offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers:
* [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only);
* [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries);
* [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) rescales the data so that it lies in the [-1,1] interval.

In [4]:
scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMeans")
print(df_scaled.mean(axis=0))
print("\nStandard deviations")
print(df_scaled.std(axis=0))

Scaled data
[[-0.52489066 -0.29488391 0.5 ]
 [-0.81649658 -1.03209369 -1.5 ]
 [ 1.80795671 1.17953565 1.5 ]
 [ 0.34992711 1.17953565 0. ]
 [-0.81649658 -1.03209369 -0.5 ]]

Means
[ -1.99840144e-16 0.00000000e+00 0.00000000e+00]

Standard deviations
[ 1. 1. 1.]


You can notice that `pandas` dataframes are turned into `numpy` arrays after scaling (`scikit-learn` works with `numpy` arrays).

Once the scaler has been fitted to the data, the `transform` methods turns unscaled data to its scaled equivalent, while `inverse_transform` transforms scaled data back to its unscaled representation:

In [5]:
print(scaler.inverse_transform(df_scaled))

[[ 23. 70. 180.]
 [ 22. 65. 160.]
 [ 31. 80. 190.]
 [ 26. 80. 175.]
 [ 22. 65. 170.]]


Other scalers can be used in a similar manner:

In [6]:
scaler = MinMaxScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))

Scaled data
[[ 0.11111111 0.33333333 0.66666667]
 [ 0. 0. 0. ]
 [ 1. 1. 1. ]
 [ 0.44444444 1. 0.5 ]
 [ 0. 0. 0.33333333]]

Minimum values
[ 0. 0. 0.]

Maximum values
[ 1. 1. 1.]

Inverse transforms
[[ 23. 70. 180.]
 [ 22. 65. 160.]
 [ 31. 80. 190.]
 [ 26. 80. 175.]
 [ 22. 65. 170.]]


In [7]:
scaler = MaxAbsScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))

Scaled data
[[ 0.74193548 0.875 0.94736842]
 [ 0.70967742 0.8125 0.84210526]
 [ 1. 1. 1. ]
 [ 0.83870968 1. 0.92105263]
 [ 0.70967742 0.8125 0.89473684]]

Minimum values
[ 0.70967742 0.8125 0.84210526]

Maximum values
[ 1. 1. 1.]

Inverse transforms
[[ 23. 70. 180.]
 [ 22. 65. 160.]
 [ 31. 80. 190.]
 [ 26. 80. 175.]
 [ 22. 65. 170.]]
