{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# No Shows By Chase Kregor<a name='start' />\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook Bookmarks:\n",
    "\n",
    "### - Go to <a href=#start>The Top of The Notebook</a>\n",
    "### - Go to <a href=#cleanandeda>Clean and EDA</a>\n",
    "### - Go to <a href=#featureselectionandcreations>Feature Selection and Creation</a>\n",
    "### - Go to <a href=#trainmodels>Train Models</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "#import matplotlib.style as style\n",
    "#style.use('fivethirtyeight')\n",
    "import seaborn as sb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start of Clean and EDA <a name='cleanandeda' />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientId</th>\n",
       "      <th>AppointmentID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hipertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handcap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>No-show</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>5642903</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>5642503</td>\n",
       "      <td>M</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>5642549</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>5642828</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>5642494</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientId  AppointmentID Gender          ScheduledDay  \\\n",
       "0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   \n",
       "1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   \n",
       "2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   \n",
       "3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   \n",
       "4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   \n",
       "\n",
       "         AppointmentDay  Age      Neighbourhood  Scholarship  Hipertension  \\\n",
       "0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   \n",
       "1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   \n",
       "2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   \n",
       "3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   \n",
       "4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handcap  SMS_received No-show  \n",
       "0         0           0        0             0      No  \n",
       "1         0           0        0             0      No  \n",
       "2         0           0        0             0      No  \n",
       "3         0           0        0             0      No  \n",
       "4         1           0        0             0      No  "
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv(\"../data/KaggleV2-May-2016.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>AppointmentID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>5642903</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>5642503</td>\n",
       "      <td>M</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>5642549</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>5642828</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>5642494</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID  AppointmentID Gender          ScheduledDay  \\\n",
       "0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   \n",
       "1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   \n",
       "2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   \n",
       "3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   \n",
       "4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   \n",
       "\n",
       "         AppointmentDay  Age      Neighbourhood  Scholarship  Hypertension  \\\n",
       "0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   \n",
       "1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   \n",
       "2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   \n",
       "3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   \n",
       "4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received NoShow  \n",
       "0         0           0         0             0     No  \n",
       "1         0           0         0             0     No  \n",
       "2         0           0         0             0     No  \n",
       "3         0           0         0             0     No  \n",
       "4         1           0         0             0     No  "
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.rename(columns = {'Hipertension':'Hypertension',\n",
    "                       'PatientId': 'PatientID',\n",
    "                       'Handcap': 'Handicap',\n",
    "                       'No-show': 'NoShow',\n",
    "                       'Alcoholism':'Alcoholism'\n",
    "                         }, inplace = True)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now trying to understand the data set. It's distributions and unique values. Also attempting to find funky and incorrect data points. I want to understand and check the integrity of the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8.221459e+14    88\n",
       "9.963767e+10    84\n",
       "2.688613e+13    70\n",
       "3.353478e+13    65\n",
       "2.584244e+11    62\n",
       "7.579746e+13    62\n",
       "8.713749e+14    62\n",
       "6.264199e+12    62\n",
       "6.684488e+13    57\n",
       "8.722785e+11    55\n",
       "8.923969e+13    54\n",
       "8.435224e+09    51\n",
       "8.534397e+14    50\n",
       "1.447997e+13    46\n",
       "6.543360e+13    46\n",
       "8.189452e+13    42\n",
       "9.452745e+12    42\n",
       "1.882323e+14    40\n",
       "9.496197e+12    38\n",
       "2.271580e+12    38\n",
       "1.336493e+13    37\n",
       "1.484143e+12    35\n",
       "8.883500e+13    34\n",
       "9.861628e+14    34\n",
       "7.124589e+14    33\n",
       "4.167557e+14    30\n",
       "6.128878e+12    30\n",
       "8.121397e+13    29\n",
       "8.634164e+12    24\n",
       "3.699499e+13    23\n",
       "                ..\n",
       "6.375629e+12     1\n",
       "9.369127e+12     1\n",
       "5.375556e+14     1\n",
       "1.662184e+11     1\n",
       "7.234615e+13     1\n",
       "9.649990e+12     1\n",
       "6.912783e+10     1\n",
       "1.954265e+13     1\n",
       "2.736377e+10     1\n",
       "5.532694e+11     1\n",
       "7.149583e+12     1\n",
       "8.676752e+13     1\n",
       "7.838359e+13     1\n",
       "5.962625e+11     1\n",
       "4.919862e+13     1\n",
       "3.477350e+14     1\n",
       "1.626595e+13     1\n",
       "7.794917e+12     1\n",
       "1.161950e+13     1\n",
       "5.615364e+14     1\n",
       "4.355592e+11     1\n",
       "1.321328e+12     1\n",
       "1.751987e+13     1\n",
       "4.262579e+13     1\n",
       "3.115681e+13     1\n",
       "1.222828e+13     1\n",
       "6.821231e+11     1\n",
       "7.163981e+14     1\n",
       "9.798964e+14     1\n",
       "2.724571e+11     1\n",
       "Name: PatientID, Length: 62299, dtype: int64"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.PatientID.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "making sure there arent duplicate appointment IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5769215    1\n",
       "5731652    1\n",
       "5707080    1\n",
       "5702986    1\n",
       "5715276    1\n",
       "5717325    1\n",
       "5711182    1\n",
       "5758289    1\n",
       "5762391    1\n",
       "5741913    1\n",
       "5483871    1\n",
       "5660001    1\n",
       "5653858    1\n",
       "5666148    1\n",
       "5668197    1\n",
       "5641576    1\n",
       "5639531    1\n",
       "5649772    1\n",
       "5645678    1\n",
       "5647727    1\n",
       "5692785    1\n",
       "5686642    1\n",
       "5694838    1\n",
       "5696887    1\n",
       "5674360    1\n",
       "5733701    1\n",
       "5651786    1\n",
       "5672315    1\n",
       "5719362    1\n",
       "5672187    1\n",
       "          ..\n",
       "5744033    1\n",
       "5748131    1\n",
       "5739943    1\n",
       "5672324    1\n",
       "5682563    1\n",
       "5680512    1\n",
       "5782866    1\n",
       "5496110    1\n",
       "5713200    1\n",
       "5711153    1\n",
       "5717298    1\n",
       "5709110    1\n",
       "5707063    1\n",
       "5729592    1\n",
       "5463358    1\n",
       "5565768    1\n",
       "5776721    1\n",
       "5789023    1\n",
       "5590396    1\n",
       "5606756    1\n",
       "5608807    1\n",
       "5635434    1\n",
       "5621101    1\n",
       "5686470    1\n",
       "5582192    1\n",
       "5586290    1\n",
       "5584243    1\n",
       "5598584    1\n",
       "5602682    1\n",
       "5771266    1\n",
       "Name: AppointmentID, Length: 110527, dtype: int64"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.AppointmentID.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['F', 'M'], dtype=object)"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Gender.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "F    71840\n",
       "M    38687\n",
       "Name: Gender, dtype: int64"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Gender.value_counts()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "trying to understand my timeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['2016-04-29T00:00:00Z', '2016-05-03T00:00:00Z',\n",
       "       '2016-05-10T00:00:00Z', '2016-05-17T00:00:00Z',\n",
       "       '2016-05-24T00:00:00Z', '2016-05-31T00:00:00Z',\n",
       "       '2016-05-02T00:00:00Z', '2016-05-30T00:00:00Z',\n",
       "       '2016-05-16T00:00:00Z', '2016-05-04T00:00:00Z',\n",
       "       '2016-05-19T00:00:00Z', '2016-05-12T00:00:00Z',\n",
       "       '2016-05-06T00:00:00Z', '2016-05-20T00:00:00Z',\n",
       "       '2016-05-05T00:00:00Z', '2016-05-13T00:00:00Z',\n",
       "       '2016-05-09T00:00:00Z', '2016-05-25T00:00:00Z',\n",
       "       '2016-05-11T00:00:00Z', '2016-05-18T00:00:00Z',\n",
       "       '2016-05-14T00:00:00Z', '2016-06-02T00:00:00Z',\n",
       "       '2016-06-03T00:00:00Z', '2016-06-06T00:00:00Z',\n",
       "       '2016-06-07T00:00:00Z', '2016-06-01T00:00:00Z',\n",
       "       '2016-06-08T00:00:00Z'], dtype=object)"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.AppointmentDay.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 0      3539\n",
       " 1      2273\n",
       " 52     1746\n",
       " 49     1652\n",
       " 53     1651\n",
       " 56     1635\n",
       " 38     1629\n",
       " 59     1624\n",
       " 2      1618\n",
       " 50     1613\n",
       " 57     1603\n",
       " 36     1580\n",
       " 51     1567\n",
       " 19     1545\n",
       " 39     1536\n",
       " 37     1533\n",
       " 54     1530\n",
       " 34     1526\n",
       " 33     1524\n",
       " 30     1521\n",
       " 6      1521\n",
       " 3      1513\n",
       " 17     1509\n",
       " 32     1505\n",
       " 5      1489\n",
       " 44     1487\n",
       " 18     1487\n",
       " 58     1469\n",
       " 46     1460\n",
       " 45     1453\n",
       "        ... \n",
       " 74      602\n",
       " 76      571\n",
       " 75      544\n",
       " 78      541\n",
       " 77      527\n",
       " 80      511\n",
       " 81      434\n",
       " 82      392\n",
       " 79      390\n",
       " 84      311\n",
       " 83      280\n",
       " 85      275\n",
       " 86      260\n",
       " 87      184\n",
       " 89      173\n",
       " 88      126\n",
       " 90      109\n",
       " 92       86\n",
       " 91       66\n",
       " 93       53\n",
       " 94       33\n",
       " 95       24\n",
       " 96       17\n",
       " 97       11\n",
       " 98        6\n",
       " 115       5\n",
       " 100       4\n",
       " 102       2\n",
       " 99        1\n",
       "-1         1\n",
       "Name: Age, Length: 104, dtype: int64"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Age.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "need to get rid of negative value. It is impossible for someone to be -1.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    }
   ],
   "source": [
    "data['Age'][data['Age'] < 0] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      3539\n",
       "1      2274\n",
       "52     1746\n",
       "49     1652\n",
       "53     1651\n",
       "56     1635\n",
       "38     1629\n",
       "59     1624\n",
       "2      1618\n",
       "50     1613\n",
       "57     1603\n",
       "36     1580\n",
       "51     1567\n",
       "19     1545\n",
       "39     1536\n",
       "37     1533\n",
       "54     1530\n",
       "34     1526\n",
       "33     1524\n",
       "30     1521\n",
       "6      1521\n",
       "3      1513\n",
       "17     1509\n",
       "32     1505\n",
       "5      1489\n",
       "44     1487\n",
       "18     1487\n",
       "58     1469\n",
       "46     1460\n",
       "45     1453\n",
       "       ... \n",
       "72      615\n",
       "74      602\n",
       "76      571\n",
       "75      544\n",
       "78      541\n",
       "77      527\n",
       "80      511\n",
       "81      434\n",
       "82      392\n",
       "79      390\n",
       "84      311\n",
       "83      280\n",
       "85      275\n",
       "86      260\n",
       "87      184\n",
       "89      173\n",
       "88      126\n",
       "90      109\n",
       "92       86\n",
       "91       66\n",
       "93       53\n",
       "94       33\n",
       "95       24\n",
       "96       17\n",
       "97       11\n",
       "98        6\n",
       "115       5\n",
       "100       4\n",
       "102       2\n",
       "99        1\n",
       "Name: Age, Length: 103, dtype: int64"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Age.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    99666\n",
       "1    10861\n",
       "Name: Scholarship, dtype: int64"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Scholarship.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    88726\n",
       "1    21801\n",
       "Name: Hypertension, dtype: int64"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Hypertension.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    102584\n",
       "1      7943\n",
       "Name: Diabetes, dtype: int64"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Diabetes.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    108286\n",
       "1      2042\n",
       "2       183\n",
       "3        13\n",
       "4         3\n",
       "Name: Handicap, dtype: int64"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.Handicap.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    75045\n",
       "1    35482\n",
       "Name: SMS_received, dtype: int64"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.SMS_received.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "No     88208\n",
       "Yes    22319\n",
       "Name: NoShow, dtype: int64"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.NoShow.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "making sure there aren't any values missing in the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PatientID         0\n",
       "AppointmentID     0\n",
       "Gender            0\n",
       "ScheduledDay      0\n",
       "AppointmentDay    0\n",
       "Age               0\n",
       "Neighbourhood     0\n",
       "Scholarship       0\n",
       "Hypertension      0\n",
       "Diabetes          0\n",
       "Alcoholism        0\n",
       "Handicap          0\n",
       "SMS_received      0\n",
       "NoShow            0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Age: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 115]\n",
      "Gender: ['F' 'M']\n",
      "Diabetes: [0 1]\n",
      "Alchoholism: [0 1]\n",
      "Hypertension: [1 0]\n",
      "Handicap: [0 1 2 3 4]\n",
      "Scholarship: [0 1]\n",
      "SMS_received: [0 1]\n"
     ]
    }
   ],
   "source": [
    "print('Age:',sorted(data.Age.unique()))\n",
    "print('Gender:',data.Gender.unique())\n",
    "#print('DayOfTheWeek:',data.DayOfTheWeek.unique())\n",
    "#print('Status:',data.Status.unique())\n",
    "print('Diabetes:',data.Diabetes.unique())\n",
    "print('Alchoholism:',data.Alcoholism.unique())\n",
    "print('Hypertension:',data.Hypertension.unique())\n",
    "print('Handicap:',data.Handicap.unique())\n",
    "#print('Smokes:',data.Smokes.unique())\n",
    "print('Scholarship:',data.Scholarship.unique())\n",
    "#print('Tuberculosis:',data.Tuberculosis.unique())\n",
    "print('SMS_received:',data.SMS_received.unique())\n",
    "#print('AwaitingTime:',sorted(data.AwaitingTime.unique()))\n",
    "#print('HourOfTheDay:', sorted(data.HourOfTheDay.unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>AppointmentID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>5642903</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>5642503</td>\n",
       "      <td>M</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>5642549</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>5642828</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>5642494</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID  AppointmentID Gender          ScheduledDay  \\\n",
       "0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   \n",
       "1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   \n",
       "2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   \n",
       "3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   \n",
       "4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   \n",
       "\n",
       "         AppointmentDay  Age      Neighbourhood  Scholarship  Hypertension  \\\n",
       "0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   \n",
       "1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   \n",
       "2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   \n",
       "3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   \n",
       "4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received NoShow  \n",
       "0         0           0         0             0     No  \n",
       "1         0           0         0             0     No  \n",
       "2         0           0         0             0     No  \n",
       "3         0           0         0             0     No  \n",
       "4         1           0         0             0     No  "
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Selection and Creation <a name='featureselectionandcreations' />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>AppointmentID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>5642903</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>5642503</td>\n",
       "      <td>M</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>5642549</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>5642828</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>5642494</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID  AppointmentID Gender          ScheduledDay  \\\n",
       "0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   \n",
       "1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   \n",
       "2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   \n",
       "3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   \n",
       "4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   \n",
       "\n",
       "         AppointmentDay  Age      Neighbourhood  Scholarship  Hypertension  \\\n",
       "0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   \n",
       "1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   \n",
       "2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   \n",
       "3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   \n",
       "4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received NoShow  \n",
       "0         0           0         0             0     No  \n",
       "1         0           0         0             0     No  \n",
       "2         0           0         0             0     No  \n",
       "3         0           0         0             0     No  \n",
       "4         1           0         0             0     No  "
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.drop(data.columns[[0]], axis=1)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating two binary columns for the gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>AppointmentID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>5642903</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>5642503</td>\n",
       "      <td>M</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>5642549</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>5642828</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>5642494</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID  AppointmentID Gender          ScheduledDay  \\\n",
       "0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   \n",
       "1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   \n",
       "2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   \n",
       "3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   \n",
       "4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   \n",
       "\n",
       "         AppointmentDay  Age      Neighbourhood  Scholarship  Hypertension  \\\n",
       "0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   \n",
       "1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   \n",
       "2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   \n",
       "3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   \n",
       "4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received NoShow  Male  Female  \n",
       "0         0           0         0             0     No     0       1  \n",
       "1         0           0         0             0     No     1       0  \n",
       "2         0           0         0             0     No     0       1  \n",
       "3         0           0         0             0     No     0       1  \n",
       "4         1           0         0             0     No     0       1  "
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['Male'] = data['Gender'].replace(['F','M'], [0,1])\n",
    "data['Female'] = data['Gender'].replace(['F','M'], [1,0])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "changing NoShow column to be binary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>AppointmentID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>5642903</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>5642503</td>\n",
       "      <td>M</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>5642549</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>5642828</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>5642494</td>\n",
       "      <td>F</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID  AppointmentID Gender          ScheduledDay  \\\n",
       "0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   \n",
       "1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   \n",
       "2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   \n",
       "3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   \n",
       "4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   \n",
       "\n",
       "         AppointmentDay  Age      Neighbourhood  Scholarship  Hypertension  \\\n",
       "0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   \n",
       "1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             0   \n",
       "2  2016-04-29T00:00:00Z   62      MATA DA PRAIA            0             0   \n",
       "3  2016-04-29T00:00:00Z    8  PONTAL DE CAMBURI            0             0   \n",
       "4  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received  NoShow  Male  Female  \n",
       "0         0           0         0             0       0     0       1  \n",
       "1         0           0         0             0       0     1       0  \n",
       "2         0           0         0             0       0     0       1  \n",
       "3         0           0         0             0       0     0       1  \n",
       "4         1           0         0             0       0     0       1  "
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['NoShow'] = data['NoShow'].replace(['Yes','No'], [1,0])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "dropping uneeded columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>2016-04-29T18:38:08Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>2016-04-29T16:08:27Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>2016-04-29T16:19:04Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>2016-04-29T17:29:31Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>2016-04-29T16:07:23Z</td>\n",
       "      <td>2016-04-29T00:00:00Z</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID          ScheduledDay        AppointmentDay  Age  \\\n",
       "0  2.987250e+13  2016-04-29T18:38:08Z  2016-04-29T00:00:00Z   62   \n",
       "1  5.589978e+14  2016-04-29T16:08:27Z  2016-04-29T00:00:00Z   56   \n",
       "2  4.262962e+12  2016-04-29T16:19:04Z  2016-04-29T00:00:00Z   62   \n",
       "3  8.679512e+11  2016-04-29T17:29:31Z  2016-04-29T00:00:00Z    8   \n",
       "4  8.841186e+12  2016-04-29T16:07:23Z  2016-04-29T00:00:00Z   56   \n",
       "\n",
       "       Neighbourhood  Scholarship  Hypertension  Diabetes  Alcoholism  \\\n",
       "0    JARDIM DA PENHA            0             1         0           0   \n",
       "1    JARDIM DA PENHA            0             0         0           0   \n",
       "2      MATA DA PRAIA            0             0         0           0   \n",
       "3  PONTAL DE CAMBURI            0             0         0           0   \n",
       "4    JARDIM DA PENHA            0             1         1           0   \n",
       "\n",
       "   Handicap  SMS_received  NoShow  Male  Female  \n",
       "0         0             0       0     0       1  \n",
       "1         0             0       0     1       0  \n",
       "2         0             0       0     0       1  \n",
       "3         0             0       0     0       1  \n",
       "4         0             0       0     0       1  "
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = data[['PatientID','ScheduledDay','AppointmentDay', 'Age','Neighbourhood', 'Scholarship','Hypertension','Diabetes','Alcoholism','Handicap','SMS_received','NoShow','Male','Female']]\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cleaning up dates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>2016-04-29 18:38:08</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>2016-04-29 16:08:27</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>2016-04-29 16:19:04</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>2016-04-29 17:29:31</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>2016-04-29 16:07:23</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID        ScheduledDay AppointmentDay  Age      Neighbourhood  \\\n",
       "0  2.987250e+13 2016-04-29 18:38:08     2016-04-29   62    JARDIM DA PENHA   \n",
       "1  5.589978e+14 2016-04-29 16:08:27     2016-04-29   56    JARDIM DA PENHA   \n",
       "2  4.262962e+12 2016-04-29 16:19:04     2016-04-29   62      MATA DA PRAIA   \n",
       "3  8.679512e+11 2016-04-29 17:29:31     2016-04-29    8  PONTAL DE CAMBURI   \n",
       "4  8.841186e+12 2016-04-29 16:07:23     2016-04-29   56    JARDIM DA PENHA   \n",
       "\n",
       "   Scholarship  Hypertension  Diabetes  Alcoholism  Handicap  SMS_received  \\\n",
       "0            0             1         0           0         0             0   \n",
       "1            0             0         0           0         0             0   \n",
       "2            0             0         0           0         0             0   \n",
       "3            0             0         0           0         0             0   \n",
       "4            0             1         1           0         0             0   \n",
       "\n",
       "   NoShow  Male  Female  \n",
       "0       0     0       1  \n",
       "1       0     1       0  \n",
       "2       0     0       1  \n",
       "3       0     0       1  \n",
       "4       0     0       1  "
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['ScheduledDay'] = pd.to_datetime(data['ScheduledDay'])\n",
    "data['AppointmentDay'] = pd.to_datetime(data['AppointmentDay'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"\\ndata['ScheduledYear'], data['ScheduledMonth'], data['ScheduleDay'] = data['ScheduledDay'].dt.year, data['ScheduledDay'].dt.month, data['ScheduledDay'].dt.day\\ndata['AppointmentYear'], data['AppointmentMonth'], data['AppointmentDayy'] = data['AppointmentDay'].dt.year, data['AppointmentDay'].dt.month, data['AppointmentDay'].dt.day\\ndata.head\\n\""
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#original features that I was using, got rid of. \n",
    "\"\"\"\n",
    "data['ScheduledYear'], data['ScheduledMonth'], data['ScheduleDay'] = data['ScheduledDay'].dt.year, data['ScheduledDay'].dt.month, data['ScheduledDay'].dt.day\n",
    "data['AppointmentYear'], data['AppointmentMonth'], data['AppointmentDayy'] = data['AppointmentDay'].dt.year, data['AppointmentDay'].dt.month, data['AppointmentDay'].dt.day\n",
    "data.head\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Probably have to get rid of Neighbourhood column given we don't have the specefic hospital for all these NoShows. If we did we could have used distance from hospital as a feature. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a feature that calculates the wait time of a particular patient from when they schedule the appointment to when they actually have the appointment. I believe this will be a really great feature to have. One would assume that the longer the wait time the more likely people are to no show for their appointments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Waiting Time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "      <th>WaitingTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0 days</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0 days</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0 days</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0 days</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0 days</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      PatientID ScheduledDay AppointmentDay  Age      Neighbourhood  \\\n",
       "0  2.987250e+13   2016-04-29     2016-04-29   62    JARDIM DA PENHA   \n",
       "1  5.589978e+14   2016-04-29     2016-04-29   56    JARDIM DA PENHA   \n",
       "2  4.262962e+12   2016-04-29     2016-04-29   62      MATA DA PRAIA   \n",
       "3  8.679512e+11   2016-04-29     2016-04-29    8  PONTAL DE CAMBURI   \n",
       "4  8.841186e+12   2016-04-29     2016-04-29   56    JARDIM DA PENHA   \n",
       "\n",
       "   Scholarship  Hypertension  Diabetes  Alcoholism  Handicap  SMS_received  \\\n",
       "0            0             1         0           0         0             0   \n",
       "1            0             0         0           0         0             0   \n",
       "2            0             0         0           0         0             0   \n",
       "3            0             0         0           0         0             0   \n",
       "4            0             1         1           0         0             0   \n",
       "\n",
       "   NoShow  Male  Female WaitingTime  \n",
       "0       0     0       1      0 days  \n",
       "1       0     1       0      0 days  \n",
       "2       0     0       1      0 days  \n",
       "3       0     0       1      0 days  \n",
       "4       0     0       1      0 days  "
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.ScheduledDay = pd.DatetimeIndex(data.ScheduledDay).normalize()\n",
    "data['WaitingTime'] = data['AppointmentDay'] - data['ScheduledDay']\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PatientID</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Neighbourhood</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>NoShow</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "      <th>WaitingTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.987250e+13</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.589978e+14</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.262962e+12</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>MATA DA PRAIA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.679512e+11</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>8</td>\n",
       "      <td>PONTAL DE CAMBURI</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.841186e+12</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>JARDIM DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>9.598513e+13</td>\n",
       "      <td>2016-04-27</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>76</td>\n",
       "      <td>REPÚBLICA</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7.336882e+14</td>\n",
       "      <td>2016-04-27</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>23</td>\n",
       "      <td>GOIABEIRAS</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3.449833e+12</td>\n",
       "      <td>2016-04-27</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>39</td>\n",
       "      <td>GOIABEIRAS</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>5.639473e+13</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>21</td>\n",
       "      <td>ANDORINHAS</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>7.812456e+13</td>\n",
       "      <td>2016-04-27</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>19</td>\n",
       "      <td>CONQUISTA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.345362e+14</td>\n",
       "      <td>2016-04-27</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>30</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>7.542951e+12</td>\n",
       "      <td>2016-04-26</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>29</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>5.666548e+14</td>\n",
       "      <td>2016-04-28</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>22</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.113946e+14</td>\n",
       "      <td>2016-04-28</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>28</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>9.988472e+13</td>\n",
       "      <td>2016-04-28</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>54</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>9.994839e+10</td>\n",
       "      <td>2016-04-26</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>15</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>8.457439e+13</td>\n",
       "      <td>2016-04-28</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>50</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1.479497e+13</td>\n",
       "      <td>2016-04-28</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>40</td>\n",
       "      <td>CONQUISTA</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1.713538e+13</td>\n",
       "      <td>2016-04-26</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>30</td>\n",
       "      <td>NOVA PALESTINA</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>7.223289e+12</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>46</td>\n",
       "      <td>DA PENHA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       PatientID ScheduledDay AppointmentDay  Age      Neighbourhood  \\\n",
       "0   2.987250e+13   2016-04-29     2016-04-29   62    JARDIM DA PENHA   \n",
       "1   5.589978e+14   2016-04-29     2016-04-29   56    JARDIM DA PENHA   \n",
       "2   4.262962e+12   2016-04-29     2016-04-29   62      MATA DA PRAIA   \n",
       "3   8.679512e+11   2016-04-29     2016-04-29    8  PONTAL DE CAMBURI   \n",
       "4   8.841186e+12   2016-04-29     2016-04-29   56    JARDIM DA PENHA   \n",
       "5   9.598513e+13   2016-04-27     2016-04-29   76          REPÚBLICA   \n",
       "6   7.336882e+14   2016-04-27     2016-04-29   23         GOIABEIRAS   \n",
       "7   3.449833e+12   2016-04-27     2016-04-29   39         GOIABEIRAS   \n",
       "8   5.639473e+13   2016-04-29     2016-04-29   21         ANDORINHAS   \n",
       "9   7.812456e+13   2016-04-27     2016-04-29   19          CONQUISTA   \n",
       "10  7.345362e+14   2016-04-27     2016-04-29   30     NOVA PALESTINA   \n",
       "11  7.542951e+12   2016-04-26     2016-04-29   29     NOVA PALESTINA   \n",
       "12  5.666548e+14   2016-04-28     2016-04-29   22     NOVA PALESTINA   \n",
       "13  9.113946e+14   2016-04-28     2016-04-29   28     NOVA PALESTINA   \n",
       "14  9.988472e+13   2016-04-28     2016-04-29   54     NOVA PALESTINA   \n",
       "15  9.994839e+10   2016-04-26     2016-04-29   15     NOVA PALESTINA   \n",
       "16  8.457439e+13   2016-04-28     2016-04-29   50     NOVA PALESTINA   \n",
       "17  1.479497e+13   2016-04-28     2016-04-29   40          CONQUISTA   \n",
       "18  1.713538e+13   2016-04-26     2016-04-29   30     NOVA PALESTINA   \n",
       "19  7.223289e+12   2016-04-29     2016-04-29   46           DA PENHA   \n",
       "\n",
       "    Scholarship  Hypertension  Diabetes  Alcoholism  Handicap  SMS_received  \\\n",
       "0             0             1         0           0         0             0   \n",
       "1             0             0         0           0         0             0   \n",
       "2             0             0         0           0         0             0   \n",
       "3             0             0         0           0         0             0   \n",
       "4             0             1         1           0         0             0   \n",
       "5             0             1         0           0         0             0   \n",
       "6             0             0         0           0         0             0   \n",
       "7             0             0         0           0         0             0   \n",
       "8             0             0         0           0         0             0   \n",
       "9             0             0         0           0         0             0   \n",
       "10            0             0         0           0         0             0   \n",
       "11            0             0         0           0         0             1   \n",
       "12            1             0         0           0         0             0   \n",
       "13            0             0         0           0         0             0   \n",
       "14            0             0         0           0         0             0   \n",
       "15            0             0         0           0         0             1   \n",
       "16            0             0         0           0         0             0   \n",
       "17            1             0         0           0         0             0   \n",
       "18            1             0         0           0         0             1   \n",
       "19            0             0         0           0         0             0   \n",
       "\n",
       "    NoShow  Male  Female  WaitingTime  \n",
       "0        0     0       1            0  \n",
       "1        0     1       0            0  \n",
       "2        0     0       1            0  \n",
       "3        0     0       1            0  \n",
       "4        0     0       1            0  \n",
       "5        0     0       1            2  \n",
       "6        1     0       1            2  \n",
       "7        1     0       1            2  \n",
       "8        0     0       1            0  \n",
       "9        0     0       1            2  \n",
       "10       0     0       1            2  \n",
       "11       1     1       0            3  \n",
       "12       0     0       1            1  \n",
       "13       0     1       0            1  \n",
       "14       0     0       1            1  \n",
       "15       0     0       1            3  \n",
       "16       0     1       0            1  \n",
       "17       1     0       1            1  \n",
       "18       0     0       1            3  \n",
       "19       0     0       1            0  "
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['WaitingTime'] = data['WaitingTime'].apply(lambda x: x.days)\n",
    "data.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10.183701719941734"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['WaitingTime'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Need to make one last clean dataset picking which features I will actually use in the model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NoShow</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "      <th>WaitingTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NoShow ScheduledDay AppointmentDay  Age  Scholarship  Hypertension  \\\n",
       "0       0   2016-04-29     2016-04-29   62            0             1   \n",
       "1       0   2016-04-29     2016-04-29   56            0             0   \n",
       "2       0   2016-04-29     2016-04-29   62            0             0   \n",
       "3       0   2016-04-29     2016-04-29    8            0             0   \n",
       "4       0   2016-04-29     2016-04-29   56            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received  Male  Female  WaitingTime  \n",
       "0         0           0         0             0     0       1            0  \n",
       "1         0           0         0             0     1       0            0  \n",
       "2         0           0         0             0     0       1            0  \n",
       "3         0           0         0             0     0       1            0  \n",
       "4         1           0         0             0     0       1            0  "
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = data[['NoShow','ScheduledDay','AppointmentDay','Age','Scholarship','Hypertension','Diabetes','Alcoholism','Handicap','SMS_received','Male','Female','WaitingTime']]\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train Models <a name='trainmodels' />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is probably the most interesting and complex part to this analysis. \n",
    "\n",
    "I will run a bunch of different models to see what their various testing/training accuracies are in order to pick the best one and then try and optomize that particular model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NoShow</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "      <th>WaitingTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NoShow ScheduledDay AppointmentDay  Age  Scholarship  Hypertension  \\\n",
       "0       0   2016-04-29     2016-04-29   62            0             1   \n",
       "1       0   2016-04-29     2016-04-29   56            0             0   \n",
       "2       0   2016-04-29     2016-04-29   62            0             0   \n",
       "3       0   2016-04-29     2016-04-29    8            0             0   \n",
       "4       0   2016-04-29     2016-04-29   56            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received  Male  Female  WaitingTime  \n",
       "0         0           0         0             0     0       1            0  \n",
       "1         0           0         0             0     1       0            0  \n",
       "2         0           0         0             0     0       1            0  \n",
       "3         0           0         0             0     0       1            0  \n",
       "4         1           0         0             0     0       1            0  "
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "      <th>WaitingTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Age  Scholarship  Hypertension  Diabetes  Alcoholism  Handicap  \\\n",
       "0   62            0             1         0           0         0   \n",
       "1   56            0             0         0           0         0   \n",
       "2   62            0             0         0           0         0   \n",
       "3    8            0             0         0           0         0   \n",
       "4   56            0             1         1           0         0   \n",
       "\n",
       "   SMS_received  Male  Female  WaitingTime  \n",
       "0             0     0       1            0  \n",
       "1             0     1       0            0  \n",
       "2             0     0       1            0  \n",
       "3             0     0       1            0  \n",
       "4             0     0       1            0  "
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#the features we are going to train our model on\n",
    "showfeatures = data.iloc[:,3:19]\n",
    "showfeatures.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split, cross_val_score\n",
    "from sklearn.metrics import accuracy_score\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[62,  0,  1, ...,  0,  1,  0],\n",
       "       [56,  0,  0, ...,  1,  0,  0],\n",
       "       [62,  0,  0, ...,  0,  1,  0],\n",
       "       ..., \n",
       "       [21,  0,  0, ...,  0,  1, 41],\n",
       "       [38,  0,  0, ...,  0,  1, 41],\n",
       "       [54,  0,  0, ...,  0,  1, 41]])"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# TRAIN/TEST PARTITION\n",
    "#create the feature vectors and class labels\n",
    "\n",
    "\n",
    "features = np.array(showfeatures)\n",
    "labels = np.array(data['NoShow'])\n",
    "\n",
    "features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    88208\n",
       "1    22319\n",
       "Name: NoShow, dtype: int64"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['NoShow'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#split the data into training and testing sets(67% training, 33% into testing)\n",
    "training_features, testing_features, training_labels, testing_labels = train_test_split(features,labels, test_size = 0.2,random_state = 42 )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import linear_model\n",
    "\n",
    "lg = linear_model.LogisticRegression()\n",
    "lg.fit(training_features,training_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.7955445   0.79463983  0.79359873  0.79246777  0.79552138  0.79518209\n",
      "  0.79552138  0.79371183  0.79416422  0.79199186]\n",
      "[0 0 0 ..., 0 0 0]\n",
      "0.795304442233\n"
     ]
    }
   ],
   "source": [
    "# hold out cross validation\n",
    "\n",
    "#print(accuracy_score(testing_labels, predictions))\n",
    "\n",
    "score = cross_val_score(lg, training_features,training_labels, cv = 10, scoring= 'accuracy')\n",
    "print(score)\n",
    "\n",
    "predictions = lg.predict(testing_features)\n",
    "print(predictions)\n",
    "\n",
    "score = accuracy_score(testing_labels, predictions)\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi Layer Perceptron just for fun, takes forever to run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'from sklearn.neural_network import MLPClassifier\\nmlp = MLPClassifier(alpha=0.01, hidden_layer_sizes = (100,100))\\nmlp.fit(training_features,training_labels)'"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"from sklearn.neural_network import MLPClassifier\n",
    "mlp = MLPClassifier(alpha=0.01, hidden_layer_sizes = (100,100))\n",
    "mlp.fit(training_features,training_labels)\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"# hold out cross validation\\n\\n#print(accuracy_score(testing_labels, predictions))\\n\\nscore = cross_val_score(mlp, training_features,training_labels, cv = 10, scoring= 'accuracy')\\nprint(score)\\n\\npredictions = mlp.predict(testing_features)\\nprint(predictions)\\n\\nscore = accuracy_score(testing_labels, predictions)\\nprint(score)\""
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"# hold out cross validation\n",
    "\n",
    "#print(accuracy_score(testing_labels, predictions))\n",
    "\n",
    "score = cross_val_score(mlp, training_features,training_labels, cv = 10, scoring= 'accuracy')\n",
    "print(score)\n",
    "\n",
    "predictions = mlp.predict(testing_features)\n",
    "print(predictions)\n",
    "\n",
    "score = accuracy_score(testing_labels, predictions)\n",
    "print(score)\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Forrest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "            max_depth=2, max_features='auto', max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False)"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "rf = RandomForestClassifier(max_depth=2,n_estimators=100)\n",
    "rf.fit(training_features,training_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.79769309  0.79769309  0.79778331  0.79778331  0.79778331  0.79778331\n",
      "  0.79778331  0.79778331  0.79778331  0.79776043]\n",
      "[0 0 0 ..., 0 0 0]\n",
      "0.79928526192\n"
     ]
    }
   ],
   "source": [
    "# hold out cross validation\n",
    "\n",
    "#print(accuracy_score(testing_labels, predictions))\n",
    "\n",
    "score = cross_val_score(rf, training_features,training_labels, cv = 10, scoring= 'accuracy')\n",
    "print(score)\n",
    "\n",
    "predictions = rf.predict(testing_features)\n",
    "print(predictions)\n",
    "\n",
    "score = accuracy_score(testing_labels, predictions)\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## KNeighbors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#TRAIN/TEST ALGORITHM\n",
    "#instance the model\n",
    "\n",
    "kList = range(1,50)\n",
    "\n",
    "cv_scores = []\n",
    "\n",
    "neighbors = filter(lambda x: x % 2 != 0, kList)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "3\n",
      "5\n",
      "7\n",
      "9\n",
      "11\n",
      "13\n",
      "15\n",
      "17\n",
      "19\n",
      "21\n",
      "23\n",
      "25\n",
      "27\n",
      "29\n",
      "31\n",
      "33\n",
      "35\n",
      "37\n",
      "39\n",
      "41\n",
      "43\n",
      "45\n",
      "47\n",
      "49\n",
      "done finding\n"
     ]
    }
   ],
   "source": [
    "\n",
    "for i in neighbors:\n",
    "    print(i)\n",
    "    knn = KNeighborsClassifier(n_neighbors=i)\n",
    "\n",
    "    knn.fit(training_features,training_labels)\n",
    "\n",
    "    #test the model\n",
    "    predictions = knn.predict(testing_features)\n",
    "\n",
    "    # hold out cross validation\n",
    "\n",
    "    #print(accuracy_score(testing_labels, predictions))\n",
    "\n",
    "\n",
    "    scores = cross_val_score(knn, training_features,training_labels, cv = 10, scoring= 'accuracy')\n",
    "    #print(scores)\n",
    "    cv_scores.append(scores.mean())\n",
    "\n",
    "print(\"done finding\") "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "optimalk = cv_scores.index(max(cv_scores))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
       "           metric_params=None, n_jobs=1, n_neighbors=24, p=2,\n",
       "           weights='uniform')"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn = KNeighborsClassifier(n_neighbors=optimalk)\n",
    "knn.fit(training_features,training_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.795711571519\n"
     ]
    }
   ],
   "source": [
    "\n",
    "predictions = knn.predict(testing_features)\n",
    "#data['predictions'] = knn.predict(testing_features)\n",
    "\n",
    "\n",
    "score = accuracy_score(testing_labels, predictions)\n",
    "print(score)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict Whether a Patient Will Show Up: User Input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NoShow</th>\n",
       "      <th>ScheduledDay</th>\n",
       "      <th>AppointmentDay</th>\n",
       "      <th>Age</th>\n",
       "      <th>Scholarship</th>\n",
       "      <th>Hypertension</th>\n",
       "      <th>Diabetes</th>\n",
       "      <th>Alcoholism</th>\n",
       "      <th>Handicap</th>\n",
       "      <th>SMS_received</th>\n",
       "      <th>Male</th>\n",
       "      <th>Female</th>\n",
       "      <th>WaitingTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>2016-04-29</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NoShow ScheduledDay AppointmentDay  Age  Scholarship  Hypertension  \\\n",
       "0       0   2016-04-29     2016-04-29   62            0             1   \n",
       "1       0   2016-04-29     2016-04-29   56            0             0   \n",
       "2       0   2016-04-29     2016-04-29   62            0             0   \n",
       "3       0   2016-04-29     2016-04-29    8            0             0   \n",
       "4       0   2016-04-29     2016-04-29   56            0             1   \n",
       "\n",
       "   Diabetes  Alcoholism  Handicap  SMS_received  Male  Female  WaitingTime  \n",
       "0         0           0         0             0     0       1            0  \n",
       "1         0           0         0             0     1       0            0  \n",
       "2         0           0         0             0     0       1            0  \n",
       "3         0           0         0             0     0       1            0  \n",
       "4         1           0         0             0     0       1            0  "
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter the patient's age: \n",
      "5\n",
      "Is the patient on Scholarship(Yes or No): \n",
      "Yes\n",
      "Does the patient have Hypertension?(Yes or No): \n",
      "Yes\n",
      "Is the patient Diabetic?(Yes or No): \n",
      "Yes\n",
      "Is the patient an Alcoholic?(Yes or No): \n",
      "Yes\n",
      "How many Handicaps does the patient have?: \n",
      "5\n",
      "Will the patient recieve a text message?(Yes or No): \n",
      "Yes\n",
      "What is the patient gender?(Male or Female): \n",
      "Female\n",
      "How many days away is the appointment?: \n",
      "10\n",
      " \n",
      "user answer: \n",
      "[5, 1.0, 1.0, 1.0, 1.0, 5, 1.0, 1.0, 0.0, 10]\n",
      "[0]\n",
      "[0]\n",
      "[0]\n"
     ]
    }
   ],
   "source": [
    "age = 0.0\n",
    "scholarship = 0.0\n",
    "hypertension = 0.0\n",
    "diabetes = 0.0\n",
    "alcoholism = 0.0\n",
    "handicap = 0.0\n",
    "sms = 0.0\n",
    "male = 0.0\n",
    "female = 0.0\n",
    "waitingtime = 0\n",
    "\n",
    "\n",
    "\n",
    "inputage = int(input(\"Enter the patient's age: \"+ \"\\n\" ))\n",
    "inputscholarship = input(\"Is the patient on Scholarship(Yes or No): \"+ \"\\n\" )\n",
    "inputhypertension = input(\"Does the patient have Hypertension?(Yes or No): \"+ \"\\n\" )\n",
    "inputdiabetes = input(\"Is the patient Diabetic?(Yes or No): \"+ \"\\n\" )\n",
    "inputalcoholism = input(\"Is the patient an Alcoholic?(Yes or No): \"+ \"\\n\" )\n",
    "inputhandicap = int(input(\"How many Handicaps does the patient have?: \"+ \"\\n\" ))\n",
    "inputsms = input(\"Will the patient recieve a text message?(Yes or No): \"+ \"\\n\" )\n",
    "inputgender = input(\"What is the patient gender?(Male or Female): \"+ \"\\n\")\n",
    "\n",
    "inputwaitingtime = int(input(\"How many days away is the appointment?: \"+ \"\\n\"))\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "age = inputage\n",
    "handicap = inputhandicap\n",
    "waitingtime = inputwaitingtime\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "if inputscholarship == \"Yes\" or \"yes\":\n",
    "    scholarship = 1.0\n",
    "elif inputscholarship == \"No\" or \"no\":\n",
    "    scholarship = 0\n",
    "\n",
    "if hypertension == \"Yes\" or \"yes\":\n",
    "    hypertension = 1.0\n",
    "elif hypertension == \"No\" or \"no\":\n",
    "    hypertension = 0\n",
    "if inputdiabetes == \"Yes\" or \"yes\":\n",
    "    diabetes = 1.0\n",
    "elif inputdiabetes == \"No\" or \"no\":\n",
    "    daibetes = 0\n",
    "if inputalcoholism == \"Yes\" or \"yes\":\n",
    "    alcoholism = 1.0\n",
    "elif inputalcoholism == \"No\" or \"no\":\n",
    "    alcoholism = 0\n",
    "if inputsms == \"Yes\" or \"yes\":\n",
    "    sms = 1.0\n",
    "elif inputsms == \"No\" or \"no\":\n",
    "    sms = 0\n",
    "if inputgender == \"Male\" or \"M\":\n",
    "    male = 1.0\n",
    "elif inputgender == \"Female\" or \"F\":\n",
    "    female = 1.0\n",
    "else: \n",
    "    print(\"incorrect gender input\")\n",
    "   \n",
    "\n",
    "\n",
    "answer = [age,scholarship,hypertension,diabetes,alcoholism,handicap,sms,male,female,waitingtime]\n",
    "\n",
    "#the commented out chunk was for testing to not have to type everything in everytime I wanted to test\n",
    "\"\"\"\n",
    "print(\" \")\n",
    "\n",
    "print(\"Fixed answer: \")\n",
    "\n",
    "fixedanswer = [5, 1.0, 1.0 ,1.0, 1.0, 4, 0.0, 1.0, 1.0,10]\n",
    "\n",
    "print(fixedanswer)\n",
    "\n",
    "print(rf.predict([fixedanswer]))\n",
    "print(lg.predict([fixedanswer]))\n",
    "print(knn.predict([fixedanswer]))\n",
    "\"\"\"\n",
    "\n",
    "print(\" \")\n",
    "\n",
    "print(\"user answer: \")\n",
    "\n",
    "\n",
    "print(answer)\n",
    "\n",
    "print(rf.predict([answer]))\n",
    "print(lg.predict([answer]))\n",
    "print(knn.predict([answer]))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Esemble Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Here we will use an ensemble of our three different models to vote whether this patient will show up or not.\n",
      "Let's see how our individual models voted.\n",
      "\n",
      "random forrest:\n",
      "0\n",
      "show\n",
      "\n",
      "logistic regression:\n",
      "0\n",
      "show\n",
      "\n",
      "knn:\n",
      "0\n",
      "show\n",
      "\n",
      "The final vote is 0.000000 no shows to 3 shows\n"
     ]
    }
   ],
   "source": [
    "print(\"Here we will use an ensemble of our three different models to vote whether this patient will show up or not.\")\n",
    "print(\"Let's see how our individual models voted.\")\n",
    "noshowcounter = 0.0\n",
    "showcounter = 0.0\n",
    "\n",
    "randomresult = (rf.predict([answer]))\n",
    "logresult = (lg.predict([answer]))\n",
    "knnresult = (knn.predict([answer]))\n",
    "\n",
    "print(\"\")\n",
    "print(\"random forrest:\")\n",
    "print(randomresult[0])\n",
    "\n",
    "if randomresult[0] == 1:\n",
    "    noshowcounter+=1\n",
    "    print(\"no show\")\n",
    "else:\n",
    "    showcounter+=1\n",
    "    print(\"show\")\n",
    "\n",
    "print(\"\")\n",
    "print(\"logistic regression:\")\n",
    "print(logresult[0])\n",
    "\n",
    "if logresult[0] == 1:\n",
    "    noshowcounter+=1\n",
    "    print(\"no show\")\n",
    "else:\n",
    "    showcounter+=1\n",
    "    print(\"show\")\n",
    "\n",
    "    \n",
    "print(\"\")\n",
    "print(\"knn:\")\n",
    "print(knnresult[0])\n",
    "\n",
    "if knnresult[0] == 1:\n",
    "    noshowcounter+=1\n",
    "    print(\"no show\")\n",
    "else:\n",
    "    showcounter+=1\n",
    "    print(\"show\")\n",
    "\n",
    "\n",
    "    \n",
    "print(\"\")   \n",
    "print(\"The final vote is %f no shows to %d shows\" %(noshowcounter,showcounter))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis of Ensemble Model:\n",
    "What you can see here is that there needs to be more descrepencies or more features because an esenmble doesn't really help here because they all vote the same"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Our Model Predicts that this patient will show up to their appointment, intervention is not needed\n"
     ]
    }
   ],
   "source": [
    "\n",
    "if noshowcounter >= 2:\n",
    "    print(\"Our Esemble Model predicts the patient is not going to show up for their appointment.\")\n",
    "    print(\"Be advised, intervention may be neccesary or suggested in order for the patient to show up \")\n",
    "\n",
    "else:\n",
    "    print(\"Our Model Predicts that this patient will show up to their appointment, intervention is not needed\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 0 0 ..., 0 0 0]\n",
      "[0 0 0 ..., 0 0 0]\n",
      "[0 0 0 ..., 0 0 0]\n",
      "\n",
      "\n",
      "done with for loop\n"
     ]
    }
   ],
   "source": [
    "rfpredictions = rf.predict(testing_features)\n",
    "lgpredictions = lg.predict(testing_features)\n",
    "knnpredictions = knn.predict(testing_features)\n",
    "\n",
    "print(rfpredictions)\n",
    "print(lgpredictions)\n",
    "print(knnpredictions)\n",
    "print(\"\")\n",
    "\n",
    "lengthoflists = (len(rfpredictions))\n",
    "\n",
    "ensemblepredictions = []\n",
    "\n",
    "ensemblescore = 0\n",
    "\n",
    "for i in range(len(rfpredictions)):\n",
    "    \n",
    "    if rfpredictions[i] == 1:\n",
    "        ensemblescore+=1\n",
    "    if lgpredictions[i] == 1:\n",
    "        ensemblescore+=1\n",
    "    if knnpredictions[i] == 1:\n",
    "        ensemblescore+=1\n",
    "        \n",
    "    if ensemblescore >= 2:\n",
    "        ensemblescore = 1\n",
    "    else:\n",
    "        ensemblescore = 0\n",
    "        \n",
    "    ensemblepredictions.append(ensemblescore)\n",
    "    \n",
    "    \n",
    "    \n",
    "    \n",
    "    \n",
    "    \n",
    "    \n",
    "print(\"\")   \n",
    "print(\"done with for loop\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "0.799149552158\n"
     ]
    }
   ],
   "source": [
    "print(\"\")   \n",
    "#uncomment if you want to see my ensemble predicting basically all zeros.\n",
    "#print(ensemblepredictions)\n",
    "\n",
    "ensemblescore = accuracy_score(testing_labels, ensemblepredictions)\n",
    "print(ensemblescore)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[17658,    11],\n",
       "       [ 4429,     8]])"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "confusion_matrix(testing_labels, ensemblepredictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Error Analysis:\n",
    "I was a bit confused what was going on here and why all of the predictions were zeros but what we learned is that just because you have these features doesn't neccesarily mean you can convert them into a yes or no no show. Rather indiviudal classifiers and ensemble model learned if you just say people are always going to show up you will get an 80% accuracy\n",
    "\n",
    "That being said, you can still predict the probability at which individuals may or may not show up using predicta proba on logistic regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predicting the probability of an individual patient not showing up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Patient #1: 5 year old female with a lot of medical \"problems\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter the patient's age: \n",
      "5\n",
      "Is the patient on Scholarship(Yes or No): \n",
      "Yes\n",
      "Does the patient have Hypertension?(Yes or No): \n",
      "Yes\n",
      "Is the patient Diabetic?(Yes or No): \n",
      "Yes\n",
      "Is the patient an Alcoholic?(Yes or No): \n",
      "Yes\n",
      "How many Handicaps does the patient have?: \n",
      "5\n",
      "Will the patient recieve a text message?(Yes or No): \n",
      "Yes\n",
      "What is the patient gender?(Male or Female): \n",
      "Female\n",
      "How many days away is the appointment?: \n",
      "10\n",
      " \n",
      " \n",
      "user answer: \n",
      "[5, 1.0, 1.0, 1.0, 1.0, 5, 1.0, 1.0, 0.0, 10]\n"
     ]
    }
   ],
   "source": [
    "age = 0.0\n",
    "scholarship = 0.0\n",
    "hypertension = 0.0\n",
    "diabetes = 0.0\n",
    "alcoholism = 0.0\n",
    "handicap = 0.0\n",
    "sms = 0.0\n",
    "male = 0.0\n",
    "female = 0.0\n",
    "waitingtime = 0\n",
    "\n",
    "\n",
    "\n",
    "inputage = int(input(\"Enter the patient's age: \"+ \"\\n\" ))\n",
    "inputscholarship = input(\"Is the patient on Scholarship(Yes or No): \"+ \"\\n\" )\n",
    "inputhypertension = input(\"Does the patient have Hypertension?(Yes or No): \"+ \"\\n\" )\n",
    "inputdiabetes = input(\"Is the patient Diabetic?(Yes or No): \"+ \"\\n\" )\n",
    "inputalcoholism = input(\"Is the patient an Alcoholic?(Yes or No): \"+ \"\\n\" )\n",
    "inputhandicap = int(input(\"How many Handicaps does the patient have?: \"+ \"\\n\" ))\n",
    "inputsms = input(\"Will the patient recieve a text message?(Yes or No): \"+ \"\\n\" )\n",
    "inputgender = input(\"What is the patient gender?(Male or Female): \"+ \"\\n\")\n",
    "\n",
    "inputwaitingtime = int(input(\"How many days away is the appointment?: \"+ \"\\n\"))\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "age = inputage\n",
    "handicap = inputhandicap\n",
    "waitingtime = inputwaitingtime\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "if inputscholarship == \"Yes\" or \"yes\":\n",
    "    scholarship = 1.0\n",
    "elif inputscholarship == \"No\" or \"no\":\n",
    "    scholarship = 0\n",
    "\n",
    "if hypertension == \"Yes\" or \"yes\":\n",
    "    hypertension = 1.0\n",
    "elif hypertension == \"No\" or \"no\":\n",
    "    hypertension = 0\n",
    "if inputdiabetes == \"Yes\" or \"yes\":\n",
    "    diabetes = 1.0\n",
    "elif inputdiabetes == \"No\" or \"no\":\n",
    "    daibetes = 0\n",
    "if inputalcoholism == \"Yes\" or \"yes\":\n",
    "    alcoholism = 1.0\n",
    "elif inputalcoholism == \"No\" or \"no\":\n",
    "    alcoholism = 0\n",
    "if inputsms == \"Yes\" or \"yes\":\n",
    "    sms = 1.0\n",
    "elif inputsms == \"No\" or \"no\":\n",
    "    sms = 0\n",
    "if inputgender == \"Male\" or \"M\":\n",
    "    male = 1.0\n",
    "elif inputgender == \"Female\" or \"F\":\n",
    "    female = 1.0\n",
    "else: \n",
    "    print(\"incorrect gender input\")\n",
    "   \n",
    "\n",
    "\n",
    "answer = [age,scholarship,hypertension,diabetes,alcoholism,handicap,sms,male,female,waitingtime]\n",
    "\n",
    "#the commented out chunk was for testing to not have to type everything in everytime I wanted to test\n",
    "\n",
    "\n",
    "print(\" \")\n",
    "\"\"\"\n",
    "print(\"Fixed answer: \")\n",
    "\n",
    "fixedanswer = [100, 1.0, 1.0 ,1.0, 1.0, 5, 0.0, 1.0, 1.0,10]\n",
    "\n",
    "print(fixedanswer)\n",
    "\n",
    "\n",
    "#print(rf.predict([fixedanswer]))\n",
    "#print(lg.predict([fixedanswer]))\n",
    "#print(knn.predict([fixedanswer]))\n",
    "\"\"\"\n",
    "\n",
    "print(\" \")\n",
    "\n",
    "print(\"user answer: \")\n",
    "\n",
    "\n",
    "print(answer)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There is a 0.540198 percent chance patient #1 does show up and a 0.459802 percent chance this person doesn't show up\n"
     ]
    }
   ],
   "source": [
    "\n",
    "probabilityresult = lg.predict_proba([answer])\n",
    "\n",
    "showupproba = probabilityresult[0]\n",
    "showupproba = showupproba[0]\n",
    "\n",
    "\n",
    "noshowproba = probabilityresult[0]\n",
    "noshowproba = noshowproba[1]\n",
    "\n",
    "print(\"There is a %f percent chance patient #1 does show up and a %f percent chance this person doesn't show up\" % (showupproba,noshowproba))\n",
    "      \n",
    "      "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Patient #2: 100 year old female with the same lot of medical \"problems\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter the patient's age: \n",
      "100\n",
      "Is the patient on Scholarship(Yes or No): \n",
      "Yes\n",
      "Does the patient have Hypertension?(Yes or No): \n",
      "Yes\n",
      "Is the patient Diabetic?(Yes or No): \n",
      "Yes\n",
      "Is the patient an Alcoholic?(Yes or No): \n",
      "Yes\n",
      "How many Handicaps does the patient have?: \n",
      "5\n",
      "Will the patient recieve a text message?(Yes or No): \n",
      "Yes\n",
      "What is the patient gender?(Male or Female): \n",
      "Female\n",
      "How many days away is the appointment?: \n",
      "10\n",
      " \n",
      " \n",
      "user answer: \n",
      "[100, 1.0, 1.0, 1.0, 1.0, 5, 1.0, 1.0, 0.0, 10]\n"
     ]
    }
   ],
   "source": [
    "age = 0.0\n",
    "scholarship = 0.0\n",
    "hypertension = 0.0\n",
    "diabetes = 0.0\n",
    "alcoholism = 0.0\n",
    "handicap = 0.0\n",
    "sms = 0.0\n",
    "male = 0.0\n",
    "female = 0.0\n",
    "waitingtime = 0\n",
    "\n",
    "\n",
    "\n",
    "inputage = int(input(\"Enter the patient's age: \"+ \"\\n\" ))\n",
    "inputscholarship = input(\"Is the patient on Scholarship(Yes or No): \"+ \"\\n\" )\n",
    "inputhypertension = input(\"Does the patient have Hypertension?(Yes or No): \"+ \"\\n\" )\n",
    "inputdiabetes = input(\"Is the patient Diabetic?(Yes or No): \"+ \"\\n\" )\n",
    "inputalcoholism = input(\"Is the patient an Alcoholic?(Yes or No): \"+ \"\\n\" )\n",
    "inputhandicap = int(input(\"How many Handicaps does the patient have?: \"+ \"\\n\" ))\n",
    "inputsms = input(\"Will the patient recieve a text message?(Yes or No): \"+ \"\\n\" )\n",
    "inputgender = input(\"What is the patient gender?(Male or Female): \"+ \"\\n\")\n",
    "\n",
    "inputwaitingtime = int(input(\"How many days away is the appointment?: \"+ \"\\n\"))\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "age = inputage\n",
    "handicap = inputhandicap\n",
    "waitingtime = inputwaitingtime\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "if inputscholarship == \"Yes\" or \"yes\":\n",
    "    scholarship = 1.0\n",
    "elif inputscholarship == \"No\" or \"no\":\n",
    "    scholarship = 0\n",
    "\n",
    "if hypertension == \"Yes\" or \"yes\":\n",
    "    hypertension = 1.0\n",
    "elif hypertension == \"No\" or \"no\":\n",
    "    hypertension = 0\n",
    "if inputdiabetes == \"Yes\" or \"yes\":\n",
    "    diabetes = 1.0\n",
    "elif inputdiabetes == \"No\" or \"no\":\n",
    "    daibetes = 0\n",
    "if inputalcoholism == \"Yes\" or \"yes\":\n",
    "    alcoholism = 1.0\n",
    "elif inputalcoholism == \"No\" or \"no\":\n",
    "    alcoholism = 0\n",
    "if inputsms == \"Yes\" or \"yes\":\n",
    "    sms = 1.0\n",
    "elif inputsms == \"No\" or \"no\":\n",
    "    sms = 0\n",
    "if inputgender == \"Male\" or \"M\":\n",
    "    male = 1.0\n",
    "elif inputgender == \"Female\" or \"F\":\n",
    "    female = 1.0\n",
    "else: \n",
    "    print(\"incorrect gender input\")\n",
    "   \n",
    "\n",
    "\n",
    "answer = [age,scholarship,hypertension,diabetes,alcoholism,handicap,sms,male,female,waitingtime]\n",
    "\n",
    "#the commented out chunk was for testing to not have to type everything in everytime I wanted to test\n",
    "\n",
    "\n",
    "print(\" \")\n",
    "\"\"\"\n",
    "print(\"Fixed answer: \")\n",
    "\n",
    "fixedanswer = [100, 1.0, 1.0 ,1.0, 1.0, 5, 0.0, 1.0, 1.0,10]\n",
    "\n",
    "print(fixedanswer)\n",
    "\n",
    "\n",
    "#print(rf.predict([fixedanswer]))\n",
    "#print(lg.predict([fixedanswer]))\n",
    "#print(knn.predict([fixedanswer]))\n",
    "\"\"\"\n",
    "\n",
    "print(\" \")\n",
    "\n",
    "print(\"user answer: \")\n",
    "\n",
    "\n",
    "print(answer)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There is a 0.707785 percent chance patient #1 does show up and a 0.292215 percent chance this person doesn't show up\n"
     ]
    }
   ],
   "source": [
    "probabilityresult = lg.predict_proba([answer])\n",
    "\n",
    "showupproba = probabilityresult[0]\n",
    "showupproba = showupproba[0]\n",
    "\n",
    "\n",
    "noshowproba = probabilityresult[0]\n",
    "noshowproba = noshowproba[1]\n",
    "\n",
    "print(\"There is a %f percent chance patient #1 does show up and a %f percent chance this person doesn't show up\" % (showupproba,noshowproba))\n",
    "      \n",
    "      "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis of different patients\n",
    "\n",
    "Age seems to be a big huge factor which makes sense because children aren't able to get to appointments themselves. Also if a child is 5 and is an alcholic i'm pretty sure that would mean they would have an alcholic and unrealiable parents which would make sense that they would have a way less likelyhood to show up than a 100 year old with the same features. \n",
    "\n",
    "While a 16% difference in showing up or not may not be the most instiutive or knowledgable thing in the world we can \"empirically\" say these people are more or less likely to show up than one another. \n",
    "\n",
    "side note I tried different training testing partitions(10/90 and 70/30) and that didn't really have any affect on my accuracy levels. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}