"
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(x_miss).head().style.highlight_null(null_color='orange')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QN3Yzt19ggGj"
},
"source": [
"### Imputation by the mean"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZuTTnjDngkUC"
},
"source": [
"The [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) class provides basic strategies for imputing missing values as the mean imputation. This is a naive imputation, which serves as benchmark in the sequel. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "rE8Z9hUngap6"
},
"outputs": [],
"source": [
"x_mean = SimpleImputer().fit_transform(x_miss)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "DuKeiY9bgqQU",
"outputId": "4e8856ee-d76a-4ee1-85e7-2119bd967f6b"
},
"outputs": [
{
"data": {
"text/html": [
"
0
1
2
3
4
5
6
7
8
9
\n",
"
\n",
"
0
\n",
"
0.029710
\n",
"
-1.406895
\n",
"
-0.701245
\n",
"
-0.243445
\n",
"
0.033973
\n",
"
-2.146725
\n",
"
-2.522237
\n",
"
-0.761441
\n",
"
-2.379464
\n",
"
-1.368450
\n",
"
\n",
"
\n",
"
1
\n",
"
0.868731
\n",
"
-0.020743
\n",
"
-0.031679
\n",
"
0.053342
\n",
"
0.033973
\n",
"
-0.423724
\n",
"
0.074815
\n",
"
-0.030403
\n",
"
0.037237
\n",
"
-0.150938
\n",
"
\n",
"
\n",
"
2
\n",
"
2.331809
\n",
"
-0.020743
\n",
"
2.478842
\n",
"
2.833329
\n",
"
2.348079
\n",
"
2.431913
\n",
"
0.914379
\n",
"
-0.030403
\n",
"
1.057632
\n",
"
-0.041246
\n",
"
\n",
"
\n",
"
3
\n",
"
0.029710
\n",
"
-0.401317
\n",
"
0.912531
\n",
"
-0.471823
\n",
"
0.033973
\n",
"
0.072473
\n",
"
0.632320
\n",
"
-0.727102
\n",
"
0.037237
\n",
"
-0.262900
\n",
"
\n",
"
\n",
"
4
\n",
"
-0.174949
\n",
"
-0.192462
\n",
"
0.099167
\n",
"
0.053342
\n",
"
1.062209
\n",
"
0.010758
\n",
"
0.591569
\n",
"
1.045559
\n",
"
0.037237
\n",
"
-0.041246
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(x_mean).head().style.applymap(color_imputedvalues_orange, x_miss=x_miss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JtkJbkPYhjbq"
},
"source": [
"### softimpute"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ylBM0_flhwH1"
},
"source": [
"The function softimpute (original article of [Hastie and al.](http://jmlr.org/papers/volume16/hastie15a/hastie15a.pdf)) can be used to impute quantitative data. The function coded here in Python mimics the function softimpute of the [R package softImpute](https://cran.r-project.org/web/packages/softImpute/index.html). It fits a low-rank matrix approximation to a matrix with missing values via nuclear-norm regularization. The main arguments are the following. \n",
"\n",
"* `X`: the data set with missing values (matrix).\n",
"\n",
"* `lambda`: the nuclear-norm regularization parameter.\n",
"\n",
"To calibrate the parameter lambda, one may perform cross-validation, coded in the function cv_softimpute which takes in argument the data set with missing values and the length of the grid on which cross-validation is performed."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "hV6nTvcjh9p6"
},
"outputs": [],
"source": [
"cv_error, grid_lambda = cv_softimpute(x_miss, grid_len=15)\n",
"lbda = grid_lambda[np.argmin(cv_error)]\n",
"x_soft = softimpute((x_miss), lbda)[1]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "whQMfIJhiAVF",
"outputId": "64dfd84a-435c-4aa5-e3b4-a9ded99efb9b"
},
"outputs": [
{
"data": {
"text/html": [
"
0
1
2
3
4
5
6
7
8
9
\n",
"
\n",
"
0
\n",
"
-1.002617
\n",
"
-1.406895
\n",
"
-0.701245
\n",
"
-0.243445
\n",
"
-0.856365
\n",
"
-2.146725
\n",
"
-2.522237
\n",
"
-0.761441
\n",
"
-2.379464
\n",
"
-1.368450
\n",
"
\n",
"
\n",
"
1
\n",
"
0.868731
\n",
"
0.067122
\n",
"
0.026269
\n",
"
0.047901
\n",
"
0.060389
\n",
"
-0.423724
\n",
"
0.042934
\n",
"
-0.009069
\n",
"
0.058313
\n",
"
-0.150938
\n",
"
\n",
"
\n",
"
2
\n",
"
2.331809
\n",
"
1.342976
\n",
"
2.478842
\n",
"
2.833329
\n",
"
2.348079
\n",
"
2.431913
\n",
"
0.914379
\n",
"
1.266218
\n",
"
1.057632
\n",
"
1.373486
\n",
"
\n",
"
\n",
"
3
\n",
"
-0.026684
\n",
"
-0.401317
\n",
"
0.912531
\n",
"
-0.471823
\n",
"
-0.020879
\n",
"
0.072473
\n",
"
0.632320
\n",
"
-0.727102
\n",
"
-0.027941
\n",
"
-0.262900
\n",
"
\n",
"
\n",
"
4
\n",
"
-0.174949
\n",
"
-0.192462
\n",
"
0.099167
\n",
"
0.267834
\n",
"
1.062209
\n",
"
0.273338
\n",
"
0.591569
\n",
"
1.045559
\n",
"
0.224839
\n",
"
0.287975
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(x_soft).head().style.applymap(color_imputedvalues_orange, x_miss=x_miss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_NvD41DniCZk"
},
"source": [
"### Iterative chained equations"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "G0Ouw8ZuiIy_"
},
"source": [
"Iterative chained equations methods consist in (iterative) imputation using conditional expectation. The [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) class provides such methods and is inspired by the [mice](https://cran.r-project.org/web/packages/mice/index.html) package in R but differs from it by returning a single imputation instead of multiple imputations. \n",
"\n",
"The main arguments are\n",
"\n",
"* `estimator`: the estimator to use for the imputation. By default, it is the BayesianRidge estimator which does a regularized linear regression.\n",
"\n",
"* `random_state`: maximum number of imputation rounds to perform. (The last imputations are returned.)\n",
"\n",
"* `max_iter`: seed of the pseudo random number generator to use.\n",
"\n",
"The method fit_transform allows to fit the imputer on the incomplete matrix and return the complete matrix. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "17Y8BsBUiLSw"
},
"outputs": [],
"source": [
"x_ice = IterativeImputer(random_state=0, max_iter=50).fit_transform(x_miss)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "HLr1FZ0QiNHN",
"outputId": "b84761f9-b596-4d8e-fc89-6c07c8213cee"
},
"outputs": [
{
"data": {
"text/html": [
"
0
1
2
3
4
5
6
7
8
9
\n",
"
\n",
"
0
\n",
"
-1.438259
\n",
"
-1.406895
\n",
"
-0.701245
\n",
"
-0.243445
\n",
"
-0.927438
\n",
"
-2.146725
\n",
"
-2.522237
\n",
"
-0.761441
\n",
"
-2.379464
\n",
"
-1.368450
\n",
"
\n",
"
\n",
"
1
\n",
"
0.868731
\n",
"
0.155321
\n",
"
0.008853
\n",
"
0.101059
\n",
"
0.181765
\n",
"
-0.423724
\n",
"
0.162612
\n",
"
-0.050342
\n",
"
0.167739
\n",
"
-0.150938
\n",
"
\n",
"
\n",
"
2
\n",
"
2.331809
\n",
"
1.844140
\n",
"
2.478842
\n",
"
2.833329
\n",
"
2.348079
\n",
"
2.431913
\n",
"
0.914379
\n",
"
1.656961
\n",
"
1.057632
\n",
"
1.838453
\n",
"
\n",
"
\n",
"
3
\n",
"
-0.017588
\n",
"
-0.401317
\n",
"
0.912531
\n",
"
-0.471823
\n",
"
-0.017242
\n",
"
0.072473
\n",
"
0.632320
\n",
"
-0.727102
\n",
"
0.005056
\n",
"
-0.262900
\n",
"
\n",
"
\n",
"
4
\n",
"
-0.174949
\n",
"
-0.192462
\n",
"
0.099167
\n",
"
0.393973
\n",
"
1.062209
\n",
"
0.385472
\n",
"
0.591569
\n",
"
1.045559
\n",
"
0.297830
\n",
"
0.362947
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(x_ice).head().style.applymap(color_imputedvalues_orange, x_miss=x_miss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u7GSf3jriPQa"
},
"source": [
"Another estimor can be used, the ExtraTreesRegressor estimator, which trains iterative random forest instead of doing iterative regression and mimics the [missForest](https://cran.r-project.org/web/packages/missForest/missForest.pdf) in R. [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor) fits a number of randomized extra-trees and averages the results. It comes from the module sklearn.ensemble. Its main arguments are the number of trees in the forest and the random state which allows to control the sources of randomness. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "BgflJeWQiQRZ"
},
"outputs": [],
"source": [
"estimator_rf = ExtraTreesRegressor(n_estimators=10, random_state=0)\n",
"x_rf = IterativeImputer(estimator=estimator_rf, random_state=0, max_iter=50).fit_transform(x_miss)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "dvIPgX1ViTsj",
"outputId": "54db216a-7ab4-4ead-969a-5283d658176e"
},
"outputs": [
{
"data": {
"text/html": [
"
0
1
2
3
4
5
6
7
8
9
\n",
"
\n",
"
0
\n",
"
-1.092444
\n",
"
-1.406895
\n",
"
-0.701245
\n",
"
-0.243445
\n",
"
-1.047353
\n",
"
-2.146725
\n",
"
-2.522237
\n",
"
-0.761441
\n",
"
-2.379464
\n",
"
-1.368450
\n",
"
\n",
"
\n",
"
1
\n",
"
0.868731
\n",
"
-0.061079
\n",
"
0.300028
\n",
"
0.015371
\n",
"
0.149974
\n",
"
-0.423724
\n",
"
0.570247
\n",
"
-0.237455
\n",
"
0.646358
\n",
"
-0.150938
\n",
"
\n",
"
\n",
"
2
\n",
"
2.331809
\n",
"
1.479951
\n",
"
2.478842
\n",
"
2.833329
\n",
"
2.348079
\n",
"
2.431913
\n",
"
0.914379
\n",
"
1.695247
\n",
"
1.057632
\n",
"
1.232375
\n",
"
\n",
"
\n",
"
3
\n",
"
-0.283671
\n",
"
-0.401317
\n",
"
0.912531
\n",
"
-0.471823
\n",
"
0.021986
\n",
"
0.072473
\n",
"
0.632320
\n",
"
-0.727102
\n",
"
-0.088789
\n",
"
-0.262900
\n",
"
\n",
"
\n",
"
4
\n",
"
-0.174949
\n",
"
-0.192462
\n",
"
0.099167
\n",
"
0.740265
\n",
"
1.062209
\n",
"
0.366412
\n",
"
0.591569
\n",
"
1.045559
\n",
"
0.233930
\n",
"
0.318939
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(x_rf).head().style.applymap(color_imputedvalues_orange, x_miss=x_miss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6fR1VVKjiW5m"
},
"source": [
"### Sinkhorn imputation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XZ5bwrHoiYR2"
},
"source": [
"Sinkhorn imputation can be used to impute quantitative data. It relies on the idea that two batches extracted randomly from the same dataset should share the same distribution and consists in minimizing OT distances between batches. More details can be found in the original [article](https://arxiv.org/pdf/2002.03860.pdf) and the code is provided [here](https://github.com/BorisMuzellec/MissingDataOT).\n",
"\n",
"The main argument are\n",
"\n",
"* `eps`: sinkhorn regularization parameter. If the batch size is larger than half the dataset's size, it will be redefined in the imputation methods.\n",
"\n",
"* `lr` : learning rate.\n",
"\n",
"* `batchsize` : size of the batches on which the sinkhorn divergence is evaluated.\n",
"\n",
"* `niter`: number of gradient updates for each model within a cycle.\n",
"\n",
"\n",
"To set the regularization, one uses the function `pick_epsilon` which takes a multiple of the median distance.\n",
"The method fit_transform allows to fit the imputer on the incomplete matrix and return the complete matrix. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "G6ZT577jia1V"
},
"outputs": [],
"source": [
"X_true = torch.from_numpy(x_comp).double()\n",
"\n",
"eps = pick_epsilon(X_miss)\n",
"\n",
"sk_imputer = OTimputer(eps=eps, batchsize=128, lr=0.01, niter=15)\n",
"sk_imp, _, _ = sk_imputer.fit_transform(X_miss, X_true=X_true)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "nD8MI8Z2ie8Y",
"outputId": "263d9885-04c1-4101-e9a7-52730b31a9b7"
},
"outputs": [
{
"data": {
"text/html": [
"
0
1
2
3
4
5
6
7
8
9
\n",
"
\n",
"
0
\n",
"
-0.045819
\n",
"
-1.406895
\n",
"
-0.701245
\n",
"
-0.243445
\n",
"
0.052253
\n",
"
-2.146725
\n",
"
-2.522237
\n",
"
-0.761441
\n",
"
-2.379464
\n",
"
-1.368450
\n",
"
\n",
"
\n",
"
1
\n",
"
0.868731
\n",
"
-0.259010
\n",
"
-0.193653
\n",
"
-0.004790
\n",
"
0.058289
\n",
"
-0.423724
\n",
"
0.000216
\n",
"
-0.085299
\n",
"
0.098911
\n",
"
-0.150938
\n",
"
\n",
"
\n",
"
2
\n",
"
2.331809
\n",
"
-0.121655
\n",
"
2.478842
\n",
"
2.833329
\n",
"
2.348079
\n",
"
2.431913
\n",
"
0.914379
\n",
"
0.268123
\n",
"
1.057632
\n",
"
0.202173
\n",
"
\n",
"
\n",
"
3
\n",
"
-0.108161
\n",
"
-0.401317
\n",
"
0.912531
\n",
"
-0.471823
\n",
"
-0.084399
\n",
"
0.072473
\n",
"
0.632320
\n",
"
-0.727102
\n",
"
0.146242
\n",
"
-0.262900
\n",
"
\n",
"
\n",
"
4
\n",
"
-0.174949
\n",
"
-0.192462
\n",
"
0.099167
\n",
"
-0.020245
\n",
"
1.062209
\n",
"
0.181865
\n",
"
0.591569
\n",
"
1.045559
\n",
"
-0.070052
\n",
"
-0.106549
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"sk_imp_np = sk_imp.detach().numpy()\n",
"pd.DataFrame(sk_imp_np).head().style.applymap(color_imputedvalues_orange, x_miss=x_miss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HuLYbqEHih8f"
},
"source": [
"### MIWAE"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W1KOJgHc7M_8"
},
"source": [
"MIWAE imputes missing values with a deep latent variable model based on importance weighted variational inference. The original article is [here](http://proceedings.mlr.press/v97/mattei19a/mattei19a.pdf) and its code is available [here](https://github.com/pamattei/miwae). \n",
"\n",
"The main arguments are\n",
"\n",
"* `X_miss`: the data set with missing values (tensor).\n",
"\n",
"* `h`: number of hidden units in multi-layer\n",
"perceptrons (by default, $h=128$).\n",
"\n",
"* `d`: dimension of the latent space (by default $d=1$).\n",
"\n",
"* `K`: number of importance sampling during training (by default $K=20$).\n",
"\n",
"* `bs`: batch size (by default $bs=64$).\n",
"\n",
"* `n_epochs`: number of epochs (by default $n_epochs=2002$).\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "gO9zd-Va_zeO"
},
"outputs": [],
"source": [
"x_miwae = MIWAE(X_miss)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "OAlOfaiDw-28",
"outputId": "eee1088e-4f5a-4eb9-8110-ca7ee1c42ee9"
},
"outputs": [
{
"data": {
"text/html": [
"
0
1
2
3
4
5
6
7
8
9
\n",
"
\n",
"
0
\n",
"
-0.598022
\n",
"
-1.406895
\n",
"
-0.701245
\n",
"
-0.243445
\n",
"
-1.532611
\n",
"
-2.146725
\n",
"
-2.522237
\n",
"
-0.761441
\n",
"
-2.379464
\n",
"
-1.368450
\n",
"
\n",
"
\n",
"
1
\n",
"
0.868731
\n",
"
-0.139077
\n",
"
0.453679
\n",
"
0.151298
\n",
"
0.264185
\n",
"
-0.423724
\n",
"
0.330464
\n",
"
-0.033020
\n",
"
-0.287527
\n",
"
-0.150938
\n",
"
\n",
"
\n",
"
2
\n",
"
2.331809
\n",
"
1.645124
\n",
"
2.478842
\n",
"
2.833329
\n",
"
2.348079
\n",
"
2.431913
\n",
"
0.914379
\n",
"
1.169068
\n",
"
1.057632
\n",
"
2.469384
\n",
"
\n",
"
\n",
"
3
\n",
"
0.262922
\n",
"
-0.401317
\n",
"
0.912531
\n",
"
-0.471823
\n",
"
-0.099112
\n",
"
0.072473
\n",
"
0.632320
\n",
"
-0.727102
\n",
"
-0.037815
\n",
"
-0.262900
\n",
"
\n",
"
\n",
"
4
\n",
"
-0.174949
\n",
"
-0.192462
\n",
"
0.099167
\n",
"
0.806717
\n",
"
1.062209
\n",
"
0.268613
\n",
"
0.591569
\n",
"
1.045559
\n",
"
0.760923
\n",
"
0.168348
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(x_miwae).head().style.applymap(color_imputedvalues_orange, x_miss=x_miss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Rv9ue9GNb9_3"
},
"source": [
"## Numerical experiments to compare the different methods"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lFWcpcJcb_Yg"
},
"source": [
"### Synthetic data\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0WsYkb0VcYH1"
},
"source": [
"\n",
"We compare the methods presented above for different percentage of missing values and for different missing-data mechanisms: \n",
"\n",
"* Missing Completely At Random (MCAR) if the probability of being missing is the same for all observations\n",
"\n",
"* Missing At Random (MAR) if the probability of being missing only depends on observed values.\n",
"\n",
"* Missing Not At Random (MNAR) if the unavailability of the data depends on both observed and unobserved data such as its value itself.\n",
"\n",
"We compare the methods in terms of mean squared error (MSE), i.e.:\n",
"$$MSE(X^{imp}) = \\frac{1}{n_{NA}}\\sum_{i}\\sum_{j} 1_{X^{NA}_{ij}=NA}(X^{imp}_{ij} - X_{ij})^2$$\n",
"where $n_{NA} = \\sum_{i}\\sum_{j} 1_{X^{NA}_{ij}=NA}$ is the number of missing entries in $X^{NA}$.\n",
"\n",
"Note that in order to evaluate this error, we need to know the true values of the missing entries.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E36iXsYFcZs-"
},
"source": [
"The function **HowToImpute** compares the methods above with the naive imputation by the mean in terms of MSE on a complete dataset. More particularly, the function allows to introduce missing values on the complete dataset using different percentages of missing values and missing-data mechanisms and gives the MSE of the methods for the different missing-value settings. The final MSE for one specific missing-value setting is computed by aggregating the MSE's obtained for several simulations, where the stochasticity comes from the process of drawing several times the missing-data pattern.\n",
"\n",
"The arguments are the following. \n",
"\n",
"* `X`: the complete data set where the missing values will be introduced (matrix).\n",
"\n",
"* `perc.list`: list containing the different percentage of missing values. \n",
"\n",
"* `mecha.list`: list containing the different missing-data mechanisms (\"MCAR\",\"MAR\", \"MNAR\"). \n",
"\n",
"* `nbsim`: number of simulations performed. \n",
"\n",
"It returns a table containing the mean of the results for the simulations performed. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "2AYLjE6wcjtP"
},
"outputs": [],
"source": [
"def HowToImpute(x_comp , perc_list , mecha_list , nbsim):\n",
" \"\"\"\n",
" Compare in terms of MSE several imputation methods for different percentages of missing values and missing-data mechanisms.\n",
" \n",
" Parameters\n",
" ----------\n",
" x_comp : the complete data set where the missing values will be introduced (numpy array).\n",
" perc_list : list containing the different percentage of missing values.\n",
" mecha_list : list containing the different missing-data mechanisms (\"MCAR\",\"MAR\" or \"MNAR\").\n",
" nbsim : number of simulations performed.\n",
" \n",
" Returns\n",
" -------\n",
" df: dataframe containing the mean of the MSEs for the simulations performed. \n",
" \"\"\"\n",
" mecha_perc_list = pd.DataFrame([(mecha,perc) for mecha, perc in product(mecha_list,perc_list)])\n",
" df = mecha_perc_list.apply(ComparMethods, axis=1, x_comp=x_comp, nbsim=nbsim)\n",
" df.index = mecha_perc_list.apply(lambda x : x[0] + \" \" + str(x[1]), axis=1)\n",
" \n",
" return df\n",
"\n",
"def ComparMethods(mecha_perc, x_comp, nbsim):\n",
" \"\"\"\n",
" Compare in terms of MSE several imputation methods for a given percentage of missing values and a given missing-data mechanism.\n",
" \n",
" Parameters\n",
" ----------\n",
" mecha_perc : list containing the missing-data mechanism and the percentage of missing values to be used for introducing missing values. \n",
" x_comp : the complete data set where the missing values will be introduced (matrix).\n",
" nbsim : number of simulations performed.\n",
" \n",
" Returns\n",
" -------\n",
" df: dataframe containing the mean of the MSEs.\n",
" \"\"\"\n",
" mecha = mecha_perc[0]\n",
" perc = mecha_perc[1]\n",
" \n",
" RMSE_results = pd.DataFrame()\n",
" Methods = ['mean', 'softimpute', 'ice', 'rf','sk','miwae']\n",
" for meth in Methods:\n",
" RMSE_results[meth]=[]\n",
" \n",
" for sim in range(0,nbsim):\n",
" ## Introduction NA\n",
" if mecha == \"MAR\":\n",
" XproduceNA = produce_NA(x_comp, perc, mecha, p_obs=0.5)\n",
" elif mecha == \"MNAR\":\n",
" XproduceNA = produce_NA(x_comp, perc, mecha, p_obs=0.5, opt=\"logistic\")\n",
" else: \n",
" XproduceNA = produce_NA(x_comp, perc, mecha)\n",
" mask = XproduceNA['mask'].numpy()\n",
" x_miss = XproduceNA['X_incomp'].numpy()\n",
" \n",
" ## Mean\n",
" x_mean = SimpleImputer().fit_transform(x_miss)\n",
" rmse_mean = RMSE(x_mean, x_comp, mask)\n",
"\n",
" ## SoftImpute\n",
" cv_error, grid_lambda = cv_softimpute(x_miss, grid_len=15)\n",
" lbda = grid_lambda[np.argmin(cv_error)]\n",
" x_soft = softimpute((x_miss), lbda)[1]\n",
" rmse_soft = RMSE(x_soft, x_comp, mask)\n",
"\n",
" ## Ice\n",
" x_ice = IterativeImputer(random_state=0, max_iter=50).fit_transform(x_miss)\n",
" rmse_ice = RMSE(x_ice, x_comp, mask)\n",
"\n",
" ## Random Forests\n",
" estimator_rf = ExtraTreesRegressor(n_estimators=10, random_state=0)\n",
" x_rf = IterativeImputer(estimator=estimator_rf, random_state=0, max_iter=50).fit_transform(x_miss)\n",
" rmse_rf = RMSE(x_rf, x_comp, mask)\n",
" \n",
" ## Sinkhorn imputation\n",
" X_true = torch.from_numpy(x_comp).double()\n",
" X_miss = XproduceNA['X_incomp']\n",
" batchsize = 128\n",
" lr = 1e-2\n",
" epsilon = pick_epsilon(X_miss)\n",
" sk_imputer = OTimputer(eps=epsilon, batchsize=batchsize, lr=lr, niter=2000)\n",
" sk_imp, _, _ = sk_imputer.fit_transform(X_miss, verbose=True, report_interval=500, X_true=X_true)\n",
" rmse_sk_imp = RMSE(sk_imp.detach().numpy(), x_comp, mask)\n",
"\n",
" ## MIWAE \n",
" x_miwae = MIWAE(X_miss)\n",
" rmse_miwae = RMSE(x_miwae, x_comp, mask)\n",
" \n",
" new_rmse = {'mean': rmse_mean, 'softimpute': rmse_soft, 'ice': rmse_ice, 'rf': rmse_rf, 'sk': rmse_sk_imp, 'miwae': rmse_miwae}\n",
" RMSE_results = RMSE_results.append(new_rmse, ignore_index=True)\n",
"\n",
" \n",
" return RMSE_results.mean()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "3rMVUBU1erzx"
},
"outputs": [],
"source": [
"perc_list = [0.1, 0.3, 0.5]\n",
"mecha_list = [\"MCAR\", \"MAR\", \"MNAR\"]\n",
"\n",
"results_how_to_impute = HowToImpute(x_comp, perc_list , mecha_list , nbsim=2)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 328
},
"id": "II4DmN2Dwomi",
"outputId": "2e71aab9-a7de-4245-fae3-1c644f961736"
},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"ax = results_how_to_impute.plot(kind=\"bar\",rot=30)\n",
"ax.get_legend().set_bbox_to_anchor((1, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "87EDi7q22PTV"
},
"source": [
"### Real datasets"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K-WiO6er2Zff"
},
"source": [
"We will now compare the methods on real complete data set taken from [the UCI repository](http://archive.ics.uci.edu/ml) in which we will introduce missing values. In the present workflow, we propose a selection of several data sets (here, the data sets contain only quantitative variables):\n",
"\n",
"- Wine Quality - Red (1599x11)\n",
"- Wine Quality - White (4898x11)\n",
"- Slump (103x9)\n",
"\n",
"But you can test the methods on any complete dataset you want.\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "rDbXz_6J3i0l"
},
"outputs": [],
"source": [
"if not os.path.isdir('datasets'):\n",
" os.mkdir('datasets')\n",
"wine_red = dataset_loader('wine_quality_red')\n",
"wine_white = dataset_loader('wine_quality_white')\n",
"slump = dataset_loader('concrete_slump')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "M7BjJgDc7UWP"
},
"source": [
"You can choose to scale data prior to running the experiments, which implies that the variable have the same weight in the analysis. Scaling data may be performed on complete data sets but is more difficult for incomplete data sets. (For MCAR values, the estimations of the standard deviation can be unbiased. However, for MNAR values, the estimators will suffer from biases.)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "DBo48Eh_9CPf"
},
"outputs": [],
"source": [
"sc = True\n",
"if sc:\n",
" wine_white = scale(wine_white)\n",
" wine_red = scale(wine_red)\n",
" slump = scale(slump)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "xgARc2hUCS7u"
},
"outputs": [],
"source": [
"datasets_list = dict(wine_white=wine_white, wine_red=wine_red, slump=slump)\n",
"names_dataset = ['wine_white','wine_red','slump']\n",
"perc = [0.1]\n",
"mecha = [\"MCAR\"]\n",
"nbsim = 2"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yq9blkHn9fVH"
},
"source": [
"We can then apply the **HowToImpute_real** function. It compares in terms of MSE several imputation methods for different complete datasets where missing values are introduced with a given percentage of missing values and a given missing-data mechanism.\n",
"\n",
"The arguments are the following. \n",
"\n",
"* `datasets_list`: dictionnary of complete datasets.\n",
"\n",
"* `perc`: percentage of missing values.\n",
"\n",
"* `mecha`: missing-data mechanism (\"MCAR\",\"MAR\" or \"MNAR\"). \n",
"\n",
"* `nbsim`: number of simulations performed. \n",
"\n",
"* `names_dataset`: list containing the names of the datasets (for plotting results).\n",
"\n",
"It returns a table containing the mean of the MSEs for the simulations performed. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "upBvUBHV9gZQ"
},
"outputs": [],
"source": [
"def HowToImpute_real(datasets_list, perc, mecha, nbsim, names_dataset):\n",
" \"\"\"\n",
" Compare in terms of MSE several imputation methods for different complete datasets where missing values are introduced with a given percentage of missing values and a given missing-data mechanism.\n",
" \n",
" Parameters\n",
" ----------\n",
" datasets_list : dictionnary of complete datasets.\n",
" perc : percentage of missing values.\n",
" mecha_list : missing-data mechanism (\"MCAR\",\"MAR\" or \"MNAR\").\n",
" nbsim : number of simulations performed.\n",
" names_dataset : vector of the names of datasets.\n",
" \n",
" Returns\n",
" -------\n",
" res: dataframe containing the mean of the MSEs for the simulations performed. \n",
" \"\"\"\n",
"\n",
" for dat in range(0,len(datasets_list)):\n",
" df = HowToImpute(datasets_list[names_dataset[dat]], perc, mecha, nbsim)\n",
" if dat==0:\n",
" res = df\n",
" else:\n",
" res = pd.concat([res,df])\n",
" res.index = names_dataset\n",
" return(res)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "G39weYQzE8dV"
},
"outputs": [],
"source": [
"results_how_to_impute_real = HowToImpute_real(datasets_list, perc, mecha, nbsim, names_dataset)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"id": "F1_-wvJpcUNU",
"outputId": "cb8c9441-ed74-4d77-c8aa-e126fa3af74e"
},
"outputs": [
{
"data": {
"text/html": [
"