{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Archit Rungta\n", " \n", "##
Tutorial\n", "##
Imputing missing data with fancyimpute" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hi folks!\n", "\n", "Often in real world applications of data analysis, we run into the problem of missing data. This can happen due to a multitude of reasons such as:\n", " - The data was compiled from different sources/times \n", " - Corrupted during storage\n", " - Certain fields were optional\n", " - etc.\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook has the following sections:\n", " 1. Introduction\n", " 2. The Problem\n", " 3. KNN Imputation\n", " 4. Comparison And Application\n", " 5. Summary\n", " 6. Further Reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we look at the problem of missing data in data analytics. Then, we categorize the different types of missing data and briefly discuss the specific issue presented by each specific type. Finally, we look at various methods of handling data imputation and compare their accuracy on a real-world dataset with logistic regression. We also look at the validity of a commonly held assumption about imputation techniques. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##
Introduction\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Broadly, missing data is classified into 3 categories. \n", " - Missing Completely At Random (MCAR)\n", " > Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random\n", " - Missing At Random (MAR)\n", " >Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information\n", " - Missing Not At Random (MNAR)\n", " >Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR\n", " \n", "Data compilation from different sources is an example of MAR while data corruption is an example of MCAR. MNAR is not a problem we can fix with imputation because this is **non-ignorable non-response.** The only thing we can do about MNAR is to gather more information from different sources or ignore it all-together. As such we are not going to talk about MNAR anymore in this tutorial. \n", "\n", "All of the techniques that follow are applicable only for MCAR. However, in real world scenarios, MAR is more common. As such, we will treat MAR as MCAR only which gives a reasonably good approximation in practice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "##
The Problem\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with a toy example, \n", "\n", "\\begin{align}\n", "\\ y & = \\sin(x) x\\, \\text{for $|x|<=6$}\n", "\\end{align}\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt # plots\n", "import numpy as np # vectors and matrices\n", "import pandas as pd # tables and data manipulations\n", "import seaborn as sns # more plots\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAE6pJREFUeJzt3X2MHVd9xvHnwSzN8rqpYpRm7cXQBrcEp5guaVHUFyCQlFJjXFWCqhSVStuigkIFgThWC1KFHOGWFwlEZUFaVU2hFTgOglCTFFQkVFLWcYgTjClChHgdhFFZQMqK2Mmvf9y7YTe+L7t3zt6Zc+b7kSJl772ZOXNn5snc35wzxxEhAEA5nlB3AwAAaRHsAFAYgh0ACkOwA0BhCHYAKAzBDgCFIdgBoDAEOwAUhmAHgMI8sY6VXnTRRbFt27Y6Vg0A2Tp69OgPImLzsM/VEuzbtm3T/Px8HasGgGzZvn8tn6tcirG91fYXbZ+wfZ/ta6suEwAwuhRX7OckvS0i7rL9NElHbd8eEV9PsGwAwDpVvmKPiAcj4q7uv/9E0glJ01WXCwAYTdJeMba3Sdop6c4e783Znrc9f+bMmZSrBQCskCzYbT9V0qckvTUifvz49yPiYETMRsTs5s1Db+oCAEaUJNhtT6gT6jdHxKEUywQAjKbyzVPblvQxSSci4n3Vm4QcHD62oANHTur04pIumZrUdVdv1+6d3FppEvZRe6XoFXOlpNdLOm777u5rN0TEbQmWjQY6fGxBew8d19LZRyRJC4tL2nvo+GPvEybj1SvAJfXdR+yP8rmOOU9nZ2eDAUr5uvLGL2hhcem816cmJ/TTc48+FiaSNDmxSfv37CBMNsjj/ycrdb7zCyaeoB8+dPa8z09PTerL1790nE1EQraPRsTssM/VMvIUeej3U/50j1CXpMWl84Nk6ewjOnDkJMG+QQ4cObkq1KXOd/7415adXlyiRNMCBDt6GlRuuWRqsucVez/9/keA6tb73T5jcoISTQvwdEf01O9K8MCRk7ru6u2anNi06r3JiU268MkTPZd1ydTkhrWz7fp9t1OTEz33ka2++xXl4IodPfW7Ejy9uPTYld2wG3ZSJ0yuu3o7P/8T6PUdXnf19p7f+bt3XSbp/H30V/92d89l86uqLAQ7eupXblm+Qty9c7pvMNNDI71+pbH9e3Zo/54dff+n+fjv98CRkwP3K8pArxj01K+3xSg9XPr1oqGHxtql+g5T7leMH71iUEm/cssoJ/+gsg7WJtV3mHK/orkIdvQ1qNyyHsPKOhgu5XeYar+iuegVgw3XrxfNcv0dw/EdYj24YseG91jh53914/gO6blUDm6ethw30yBxHORirTdPKcW03KCBSGgPjoOyUIppubp7rPDzf7W6vo+6jwOkxRV7y/XrVTGOHivLP/8XFpcU+tmgm8PHFjZ83U1U5/dR53GA9Aj2lquztwU//1er8/ug101ZKMW0XJ09Vvj5v1qd3wc9l8pCsKO2ASsMXFqt7u+DgUvloBSD2vDzfzW+D6TCFTtqw8//1fg+kAoDlAAgE2MdoGT7Jtvft31viuUBAEaXqhTzT5I+JOmfEy0PLVf6wKXStw/1ShLsEfEl29tSLAsYNJF2CeFX+vahfvSKQeOUPnCp9O1D/cbWK8b2nKQ5SZqZmRnXarFCLj//Sx+4lOP25XLsoGNsV+wRcTAiZiNidvPmzeNaLbpyei5L6c8tyW37cjp20EEppiVy+vlf+kCd3LYvp2MHHam6O35c0n9L2m77lO0/S7FcpJPTz//dO6e1f88OTU9NypKmpyaLmvAht+3L6dhBR6peMa9LsRxsnLqfQ7JepT+3JKfty+3YAaWY1sjt5z+ag2MnPzwrpiV4DglGxbGTH54Vg+zk1PUup7ai+db6rBiu2JGVnEZt5tRWlIUaO7KSU9e7nNqKshDsyEpOXe9yaivKQrAjKzmN2syprSgLwY6s5NT1Lqe2oizcPEVWcup6l1NbURa6OwJAJsY6NR4AoDkoxaAYdQ4GYiASmoRgRxHqHAzEQCQ0DaUYFKHOwUAMRELTcMVeoDaWBeocDNTmgUhtPNZywBV7Ydo6jVmdg4HaOhCprcdaDgj2wrS1LFDnYKC2DkRq67GWA0oxhWlrWaDOwUBtHYjU1mMtBwR7Ydo8jVm/6eZS1oH7LSunqe5SafOx1nSUYgrT1rJAPynrwNSUV+NYay6CvTC7d05r/54dmp6alCVNT01q/54drbuaXJayDkxNeTWOteZKUoqxfY2kD0raJOmjEXFjiuViNG0sC/QzqA7cr6zS73VqyufjWGumysFue5OkD0t6uaRTkr5q+9MR8fWqywaq6lcHfsbkRM/RovP3/58+dXSh5yhSasrIRYpSzBWSvhUR346IhyV9QtKrEywXqKxfHdhWz7LKx+98oG+5hZoycpEi2KclPbDi71Pd14Da9asDLz50tufnH+nzGOvTi0vUlJGNFDV293jtvLPD9pykOUmamZlJsFpgbXrVgQ8cOdmzrLLJ7hnuy+UWasrIQYor9lOStq74e4uk04//UEQcjIjZiJjdvHlzgtUCo+tXVnndr2+l3ILspbhi/6qkS20/W9KCpNdK+qMEywU2zKDRorPP+vnWjSJFWZJMjWf7lZI+oE53x5si4j2DPs/UeACwfmudGi9JP/aIuE3SbSmWBQCohpGnAFAYgh0ACkOwA0BhCHYAKAzBDgCFIdgBoDAEOwAUhmAHgMIw52nGUs7lCaTEsVkvgj1Ty/Nv9poQghMIdeLYrB+lmEwx/yaaimOzfgR7pph/E03FsVk/gj1T/ebZZP5N1I1js34Ee6aYfxNNxbFZP26eZmrQRBFAnTg265dkoo31YqINAFi/tU60QSkGAApDsANAYQh2ACgMwQ4AhSHYAaAwlYLd9h/avs/2o7aH3qkFAGy8qlfs90raI+lLCdoCAEig0gCliDghSbbTtAYAUBk1dgAozNArdtt3SLq4x1v7IuLWta7I9pykOUmamZlZcwMBAOszNNgj4qoUK4qIg5IOSp1HCqRYJgDgfJRiAKAwVbs7vsb2KUkvlvRZ20fSNAsAMKqqvWJukXRLorYAABKgFAMAhSHYAaAwBDsAFIZgB4DCMOdpwx0+tsDckSgGx/N4EOwNdvjYgvYeOq6ls49IkhYWl7T30HFJ4mRAdjiex4dSTIMdOHLysZNg2dLZR3TgyMmaWgSMjuN5fAj2Bju9uLSu14Em43geH4K9wS6ZmlzX60CTcTyPD8HeYNddvV2TE5tWvTY5sUnXXb29phYBo+N4Hh9unjbY8g0lehGgBBzP4+OI8T9Bd3Z2Nubn58e+XgDIme2jETF0fulsrtjp/wogV+POryyCnf6vAHJVR35lcfOU/q8AclVHfmUR7PR/BZCrOvIri2Cn/yuAXNWRX1kEO/1fAeSqjvzK4uYp/V8B5KqO/KIfOwBkYq392LMoxQAA1q5SsNs+YPsbtu+xfYvtqVQNAwCMpuoV++2Snh8Rl0v6pqS91ZsEAKiiUrBHxOcj4lz3z69I2lK9SQCAKlLW2N8o6XP93rQ9Z3ve9vyZM2cSrhYAsNLQ7o6275B0cY+39kXErd3P7JN0TtLN/ZYTEQclHZQ6vWJGai0AYKihwR4RVw163/YbJL1K0suijr6TAIBVKg1Qsn2NpHdK+u2IeChNkwAAVVStsX9I0tMk3W77btv/kKBNAIAKKl2xR8QvpWoIACANRp4CQGGyeAhYGzD1H9qM4z8tgr0BmPoPbcbxnx6lmAZg6j+0Gcd/egR7AzD1H9qM4z89gr0BmPoPbcbxnx7B3gBM/Yc24/hPj5unDcDUf2gzjv/0mBoPADLB1HgA0FIEOwAUhmAHgMIQ7ABQGIIdAApDsANAYQh2ACgMwQ4AhSHYAaAwRTxSgIf0A2iCpmRR9sHOQ/oBNEGTsij7UgwP6QfQBE3KokrBbvtvbd9j+27bn7d9SaqGrRUP6QfQBE3KoqpX7Aci4vKIeIGkz0j6mwRtWhce0g+gCZqURZWCPSJ+vOLPp0ga+zOAeUg/gCZoUhZVvnlq+z2S/kTSjyS9ZMDn5iTNSdLMzEzV1T6Gh/QDaIImZdHQiTZs3yHp4h5v7YuIW1d8bq+kCyLiXcNWykQbALB+a51oY+gVe0RctcZ1/qukz0oaGuxt1pR+rkAuOGfWr1IpxvalEfG/3T93SfpG9SaVq0n9XIEccM6MpmqvmBtt32v7HkmvkHRtgjYVq0n9XIEccM6MptIVe0T8QaqGtEGT+rkCOeCcGU32I09z0qR+rkAOOGdGQ7CPUZP6uQI54JwZTfYPActJk/q5AjngnBnN0H7sG4F+7ACwfmvtx04pBgAKQ7ADQGEIdgAoDMEOAIUh2AGgMAQ7ABSGYAeAwhDsAFAYgh0ACkOwA0BhCHYAKAzBDgCFIdgBoDBFP7aXSXABbISmZ0uxwV73JLhN3/FA7uo6x+rOlrUothRT5yS4yzt+YXFJoZ/t+MPHFjZ83UAb1HmO5TDBdpJgt/1222H7ohTLS6HOSXBz2PFAzuo8x3KYYLtysNveKunlkr5bvTnp1DkJbg47HshZnedYDhNsp7hif7+kd0ga/xx7A9Q5CW4OOx7IWZ3nWA4TbFcKdtu7JC1ExNcStSeZ3TuntX/PDk1PTcqSpqcmtX/PjrHc3MhhxwM5q/McqzNb1mroZNa275B0cY+39km6QdIrIuJHtr8jaTYiftBnOXOS5iRpZmbm1+6///4q7W48esUAG6uN59haJ7MeGuwDVrBD0n9Keqj70hZJpyVdERHfG/Tfzs7Oxvz8/EjrBYC2Wmuwj9yPPSKOS3rmihV+RwOu2AEA41FsP3YAaKtkI08jYluqZQEARscVOwAUhmAHgMIQ7ABQGIIdAApDsANAYQh2AChMsRNtDJJyKHIbhzUDTcb53cJgTzn7SQ4zqQBtwvnd0bpSTMoH9DOhBtAsnN8drQv2lA/oZ0INoFk4vztaF+wpH9DPhBpAs3B+d7Qu2FM+oJ8JNYBm4fzuaN3N0+WbHinudKdcFoDqOL87Rp5oowom2gCA9VvrRButK8UAQOlaV4oZRa6DFAD8TJvOY4J9hV47XlK2gxQAdAwabCTlWUcfhBp71+N3vNS5A37BxBP0w4fOnvf56alJffn6l46ziQBGdOWNX9BCj/7nU5MT+um5R8877/fv2dHIcKfGvk79Rpn1CnUpj0EKADr6na+LS2ezHV06CMHetd6gzmGQAoCO9Z6vuV+4Eexd/Xb81OREtoMUAHT0G2x04ZMnen4+9wu3SsFu+922F2zf3f3nlakaNm79dvy7d12m/Xt2aHpqUlantt7U+huA3nbvnO55Hr/r9y8r8sItRa+Y90fE3yVYTq2GjTIjyIG87d453fc8Lq1XDN0dVxi04wGUqcTzPkWN/c2277F9k+0LEywPAFDB0GC3fYfte3v882pJH5H0i5JeIOlBSX8/YDlztudtz585cybZBgAAVks2QMn2NkmfiYjnD/tsEwcoAUDTjWWAku1fWPHnayTdW2V5AIDqqt48fa/tF0gKSd+R9OeVWwQAqKSWZ8XYPiPp/hH/84sk/SBhc+rEtjRPKdshsS1NVHU7nhURm4d9qJZgr8L2/FpqTDlgW5qnlO2Q2JYmGtd28EgBACgMwQ4Ahckx2A/W3YCE2JbmKWU7JLalicayHdnV2AEAg+V4xQ4AGCDbYLf9Ftsnbd9n+711t6cq22+3HbYvqrsto7B9wPY3us8NusX2VN1tWi/b13SPqW/Zvr7u9ozK9lbbX7R9ont+XFt3m6qwvcn2MdufqbstVdiesv3J7nlywvaLN2pdWQa77ZdIerWkyyPiMklZPzbY9lZJL5f03brbUsHtkp4fEZdL+qakvTW3Z11sb5L0YUm/K+l5kl5n+3n1tmpk5yS9LSJ+RdJvSPrLjLdFkq6VdKLuRiTwQUn/ERG/LOlXtYHblGWwS3qTpBsj4qeSFBHfr7k9Vb1f0jvUGcGbpYj4fESc6/75FUlb6mzPCK6Q9K2I+HZEPCzpE+pcPGQnIh6MiLu6//4TdQIky+fS2t4i6fckfbTutlRh++mSfkvSxyQpIh6OiMWNWl+uwf5cSb9p+07b/2X7RXU3aFS2d0laiIiv1d2WhN4o6XN1N2KdpiU9sOLvU8o0DFfqPpxvp6Q7623JyD6gzkXPo3U3pKLnSDoj6R+7ZaWP2n7KRq2ssRNt2L5D0sU93tqnTrsvVOdn5osk/bvt50RDu/gM2ZYbJL1ivC0azaDtiIhbu5/Zp04p4OZxti0B93itkcfTWtl+qqRPSXprRPy47vasl+1XSfp+RBy1/Tt1t6eiJ0p6oaS3RMSdtj8o6XpJf71RK2ukiLiq33u23yTpUDfI/8f2o+o8g6GRD3rvty22d0h6tqSv2ZY65Yu7bF8REd8bYxPXZNA+kSTbb5D0Kkkva+r/ZAc4JWnrir+3SDpdU1sqsz2hTqjfHBGH6m7PiK6UtKs7l/IFkp5u+18i4o9rbtcoTkk6FRHLv5w+qU6wb4hcSzGHJb1Ukmw/V9KTlOEDgiLieEQ8MyK2RcQ2dXb+C5sY6sPYvkbSOyXtioiH6m7PCL4q6VLbz7b9JEmvlfTpmts0EneuEj4m6UREvK/u9owqIvZGxJbuufFaSV/INNTVPacfsL08S/bLJH19o9bX2Cv2IW6SdJPteyU9LOkNGV4hluZDkn5O0u3dXx9fiYi/qLdJaxcR52y/WdIRSZsk3RQR99XcrFFdKen1ko7bvrv72g0RcVuNbYL0Fkk3dy8cvi3pTzdqRYw8BYDC5FqKAQD0QbADQGEIdgAoDMEOAIUh2AGgMAQ7ABSGYAeAwhDsAFCY/wfeeXg9qnuqywAAAABJRU5ErkJggg==\n", "text/plain": [ "