{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# No Shows By Chase Kregor\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook Bookmarks:\n", "\n", "### - Go to The Top of The Notebook\n", "### - Go to Clean and EDA\n", "### - Go to Feature Selection and Creation\n", "### - Go to Train Models" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "#import matplotlib.style as style\n", "#style.use('fivethirtyeight')\n", "import seaborn as sb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start of Clean and EDA " ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIdAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHipertensionDiabetesAlcoholismHandcapSMS_receivedNo-show
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000No
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000No
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000No
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000No
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000No
\n", "
" ], "text/plain": [ " PatientId AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hipertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handcap SMS_received No-show \n", "0 0 0 0 0 No \n", "1 0 0 0 0 No \n", "2 0 0 0 0 No \n", "3 0 0 0 0 No \n", "4 1 0 0 0 No " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(\"../data/KaggleV2-May-2016.csv\")\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShow
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000No
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000No
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000No
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000No
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000No
\n", "
" ], "text/plain": [ " PatientID AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hypertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received NoShow \n", "0 0 0 0 0 No \n", "1 0 0 0 0 No \n", "2 0 0 0 0 No \n", "3 0 0 0 0 No \n", "4 1 0 0 0 No " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.rename(columns = {'Hipertension':'Hypertension',\n", " 'PatientId': 'PatientID',\n", " 'Handcap': 'Handicap',\n", " 'No-show': 'NoShow',\n", " 'Alcoholism':'Alcoholism'\n", " }, inplace = True)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now trying to understand the data set. It's distributions and unique values. Also attempting to find funky and incorrect data points. I want to understand and check the integrity of the dataset" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8.221459e+14 88\n", "9.963767e+10 84\n", "2.688613e+13 70\n", "3.353478e+13 65\n", "2.584244e+11 62\n", "7.579746e+13 62\n", "8.713749e+14 62\n", "6.264199e+12 62\n", "6.684488e+13 57\n", "8.722785e+11 55\n", "8.923969e+13 54\n", "8.435224e+09 51\n", "8.534397e+14 50\n", "1.447997e+13 46\n", "6.543360e+13 46\n", "8.189452e+13 42\n", "9.452745e+12 42\n", "1.882323e+14 40\n", "9.496197e+12 38\n", "2.271580e+12 38\n", "1.336493e+13 37\n", "1.484143e+12 35\n", "8.883500e+13 34\n", "9.861628e+14 34\n", "7.124589e+14 33\n", "4.167557e+14 30\n", "6.128878e+12 30\n", "8.121397e+13 29\n", "8.634164e+12 24\n", "3.699499e+13 23\n", " ..\n", "6.375629e+12 1\n", "9.369127e+12 1\n", "5.375556e+14 1\n", "1.662184e+11 1\n", "7.234615e+13 1\n", "9.649990e+12 1\n", "6.912783e+10 1\n", "1.954265e+13 1\n", "2.736377e+10 1\n", "5.532694e+11 1\n", "7.149583e+12 1\n", "8.676752e+13 1\n", "7.838359e+13 1\n", "5.962625e+11 1\n", "4.919862e+13 1\n", "3.477350e+14 1\n", "1.626595e+13 1\n", "7.794917e+12 1\n", "1.161950e+13 1\n", "5.615364e+14 1\n", "4.355592e+11 1\n", "1.321328e+12 1\n", "1.751987e+13 1\n", "4.262579e+13 1\n", "3.115681e+13 1\n", "1.222828e+13 1\n", "6.821231e+11 1\n", "7.163981e+14 1\n", "9.798964e+14 1\n", "2.724571e+11 1\n", "Name: PatientID, Length: 62299, dtype: int64" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.PatientID.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "making sure there arent duplicate appointment IDs" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5769215 1\n", "5731652 1\n", "5707080 1\n", "5702986 1\n", "5715276 1\n", "5717325 1\n", "5711182 1\n", "5758289 1\n", "5762391 1\n", "5741913 1\n", "5483871 1\n", "5660001 1\n", "5653858 1\n", "5666148 1\n", "5668197 1\n", "5641576 1\n", "5639531 1\n", "5649772 1\n", "5645678 1\n", "5647727 1\n", "5692785 1\n", "5686642 1\n", "5694838 1\n", "5696887 1\n", "5674360 1\n", "5733701 1\n", "5651786 1\n", "5672315 1\n", "5719362 1\n", "5672187 1\n", " ..\n", "5744033 1\n", "5748131 1\n", "5739943 1\n", "5672324 1\n", "5682563 1\n", "5680512 1\n", "5782866 1\n", "5496110 1\n", "5713200 1\n", "5711153 1\n", "5717298 1\n", "5709110 1\n", "5707063 1\n", "5729592 1\n", "5463358 1\n", "5565768 1\n", "5776721 1\n", "5789023 1\n", "5590396 1\n", "5606756 1\n", "5608807 1\n", "5635434 1\n", "5621101 1\n", "5686470 1\n", "5582192 1\n", "5586290 1\n", "5584243 1\n", "5598584 1\n", "5602682 1\n", "5771266 1\n", "Name: AppointmentID, Length: 110527, dtype: int64" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.AppointmentID.value_counts()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['F', 'M'], dtype=object)" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Gender.unique()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "F 71840\n", "M 38687\n", "Name: Gender, dtype: int64" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Gender.value_counts()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "trying to understand my timeline" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['2016-04-29T00:00:00Z', '2016-05-03T00:00:00Z',\n", " '2016-05-10T00:00:00Z', '2016-05-17T00:00:00Z',\n", " '2016-05-24T00:00:00Z', '2016-05-31T00:00:00Z',\n", " '2016-05-02T00:00:00Z', '2016-05-30T00:00:00Z',\n", " '2016-05-16T00:00:00Z', '2016-05-04T00:00:00Z',\n", " '2016-05-19T00:00:00Z', '2016-05-12T00:00:00Z',\n", " '2016-05-06T00:00:00Z', '2016-05-20T00:00:00Z',\n", " '2016-05-05T00:00:00Z', '2016-05-13T00:00:00Z',\n", " '2016-05-09T00:00:00Z', '2016-05-25T00:00:00Z',\n", " '2016-05-11T00:00:00Z', '2016-05-18T00:00:00Z',\n", " '2016-05-14T00:00:00Z', '2016-06-02T00:00:00Z',\n", " '2016-06-03T00:00:00Z', '2016-06-06T00:00:00Z',\n", " '2016-06-07T00:00:00Z', '2016-06-01T00:00:00Z',\n", " '2016-06-08T00:00:00Z'], dtype=object)" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.AppointmentDay.unique()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " 0 3539\n", " 1 2273\n", " 52 1746\n", " 49 1652\n", " 53 1651\n", " 56 1635\n", " 38 1629\n", " 59 1624\n", " 2 1618\n", " 50 1613\n", " 57 1603\n", " 36 1580\n", " 51 1567\n", " 19 1545\n", " 39 1536\n", " 37 1533\n", " 54 1530\n", " 34 1526\n", " 33 1524\n", " 30 1521\n", " 6 1521\n", " 3 1513\n", " 17 1509\n", " 32 1505\n", " 5 1489\n", " 44 1487\n", " 18 1487\n", " 58 1469\n", " 46 1460\n", " 45 1453\n", " ... \n", " 74 602\n", " 76 571\n", " 75 544\n", " 78 541\n", " 77 527\n", " 80 511\n", " 81 434\n", " 82 392\n", " 79 390\n", " 84 311\n", " 83 280\n", " 85 275\n", " 86 260\n", " 87 184\n", " 89 173\n", " 88 126\n", " 90 109\n", " 92 86\n", " 91 66\n", " 93 53\n", " 94 33\n", " 95 24\n", " 96 17\n", " 97 11\n", " 98 6\n", " 115 5\n", " 100 4\n", " 102 2\n", " 99 1\n", "-1 1\n", "Name: Age, Length: 104, dtype: int64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Age.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "need to get rid of negative value. It is impossible for someone to be -1. \n" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] } ], "source": [ "data['Age'][data['Age'] < 0] = 1" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3539\n", "1 2274\n", "52 1746\n", "49 1652\n", "53 1651\n", "56 1635\n", "38 1629\n", "59 1624\n", "2 1618\n", "50 1613\n", "57 1603\n", "36 1580\n", "51 1567\n", "19 1545\n", "39 1536\n", "37 1533\n", "54 1530\n", "34 1526\n", "33 1524\n", "30 1521\n", "6 1521\n", "3 1513\n", "17 1509\n", "32 1505\n", "5 1489\n", "44 1487\n", "18 1487\n", "58 1469\n", "46 1460\n", "45 1453\n", " ... \n", "72 615\n", "74 602\n", "76 571\n", "75 544\n", "78 541\n", "77 527\n", "80 511\n", "81 434\n", "82 392\n", "79 390\n", "84 311\n", "83 280\n", "85 275\n", "86 260\n", "87 184\n", "89 173\n", "88 126\n", "90 109\n", "92 86\n", "91 66\n", "93 53\n", "94 33\n", "95 24\n", "96 17\n", "97 11\n", "98 6\n", "115 5\n", "100 4\n", "102 2\n", "99 1\n", "Name: Age, Length: 103, dtype: int64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Age.value_counts()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 99666\n", "1 10861\n", "Name: Scholarship, dtype: int64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Scholarship.value_counts()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 88726\n", "1 21801\n", "Name: Hypertension, dtype: int64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Hypertension.value_counts()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 102584\n", "1 7943\n", "Name: Diabetes, dtype: int64" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Diabetes.value_counts()" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 108286\n", "1 2042\n", "2 183\n", "3 13\n", "4 3\n", "Name: Handicap, dtype: int64" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Handicap.value_counts()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 75045\n", "1 35482\n", "Name: SMS_received, dtype: int64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.SMS_received.value_counts()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "No 88208\n", "Yes 22319\n", "Name: NoShow, dtype: int64" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.NoShow.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "making sure there aren't any values missing in the dataset" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PatientID 0\n", "AppointmentID 0\n", "Gender 0\n", "ScheduledDay 0\n", "AppointmentDay 0\n", "Age 0\n", "Neighbourhood 0\n", "Scholarship 0\n", "Hypertension 0\n", "Diabetes 0\n", "Alcoholism 0\n", "Handicap 0\n", "SMS_received 0\n", "NoShow 0\n", "dtype: int64" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Age: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 115]\n", "Gender: ['F' 'M']\n", "Diabetes: [0 1]\n", "Alchoholism: [0 1]\n", "Hypertension: [1 0]\n", "Handicap: [0 1 2 3 4]\n", "Scholarship: [0 1]\n", "SMS_received: [0 1]\n" ] } ], "source": [ "print('Age:',sorted(data.Age.unique()))\n", "print('Gender:',data.Gender.unique())\n", "#print('DayOfTheWeek:',data.DayOfTheWeek.unique())\n", "#print('Status:',data.Status.unique())\n", "print('Diabetes:',data.Diabetes.unique())\n", "print('Alchoholism:',data.Alcoholism.unique())\n", "print('Hypertension:',data.Hypertension.unique())\n", "print('Handicap:',data.Handicap.unique())\n", "#print('Smokes:',data.Smokes.unique())\n", "print('Scholarship:',data.Scholarship.unique())\n", "#print('Tuberculosis:',data.Tuberculosis.unique())\n", "print('SMS_received:',data.SMS_received.unique())\n", "#print('AwaitingTime:',sorted(data.AwaitingTime.unique()))\n", "#print('HourOfTheDay:', sorted(data.HourOfTheDay.unique()))" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShow
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000No
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000No
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000No
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000No
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000No
\n", "
" ], "text/plain": [ " PatientID AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hypertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received NoShow \n", "0 0 0 0 0 No \n", "1 0 0 0 0 No \n", "2 0 0 0 0 No \n", "3 0 0 0 0 No \n", "4 1 0 0 0 No " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Selection and Creation
" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShow
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000No
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000No
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000No
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000No
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000No
\n", "
" ], "text/plain": [ " PatientID AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hypertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received NoShow \n", "0 0 0 0 0 No \n", "1 0 0 0 0 No \n", "2 0 0 0 0 No \n", "3 0 0 0 0 No \n", "4 1 0 0 0 No " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop(data.columns[[0]], axis=1)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating two binary columns for the gender" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShowMaleFemale
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000No01
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000No10
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000No01
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000No01
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000No01
\n", "
" ], "text/plain": [ " PatientID AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hypertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received NoShow Male Female \n", "0 0 0 0 0 No 0 1 \n", "1 0 0 0 0 No 1 0 \n", "2 0 0 0 0 No 0 1 \n", "3 0 0 0 0 No 0 1 \n", "4 1 0 0 0 No 0 1 " ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['Male'] = data['Gender'].replace(['F','M'], [0,1])\n", "data['Female'] = data['Gender'].replace(['F','M'], [1,0])\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "changing NoShow column to be binary" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShowMaleFemale
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000001
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000010
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000001
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000001
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000001
\n", "
" ], "text/plain": [ " PatientID AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hypertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received NoShow Male Female \n", "0 0 0 0 0 0 0 1 \n", "1 0 0 0 0 0 1 0 \n", "2 0 0 0 0 0 0 1 \n", "3 0 0 0 0 0 0 1 \n", "4 1 0 0 0 0 0 1 " ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['NoShow'] = data['NoShow'].replace(['Yes','No'], [1,0])\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dropping uneeded columns" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShowMaleFemale
02.987250e+132016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000001
15.589978e+142016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000010
24.262962e+122016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000001
38.679512e+112016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000001
48.841186e+122016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000001
\n", "
" ], "text/plain": [ " PatientID ScheduledDay AppointmentDay Age \\\n", "0 2.987250e+13 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 \n", "1 5.589978e+14 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 \n", "2 4.262962e+12 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 \n", "3 8.679512e+11 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 \n", "4 8.841186e+12 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 \n", "\n", " Neighbourhood Scholarship Hypertension Diabetes Alcoholism \\\n", "0 JARDIM DA PENHA 0 1 0 0 \n", "1 JARDIM DA PENHA 0 0 0 0 \n", "2 MATA DA PRAIA 0 0 0 0 \n", "3 PONTAL DE CAMBURI 0 0 0 0 \n", "4 JARDIM DA PENHA 0 1 1 0 \n", "\n", " Handicap SMS_received NoShow Male Female \n", "0 0 0 0 0 1 \n", "1 0 0 0 1 0 \n", "2 0 0 0 0 1 \n", "3 0 0 0 0 1 \n", "4 0 0 0 0 1 " ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data[['PatientID','ScheduledDay','AppointmentDay', 'Age','Neighbourhood', 'Scholarship','Hypertension','Diabetes','Alcoholism','Handicap','SMS_received','NoShow','Male','Female']]\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cleaning up dates" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShowMaleFemale
02.987250e+132016-04-29 18:38:082016-04-2962JARDIM DA PENHA010000001
15.589978e+142016-04-29 16:08:272016-04-2956JARDIM DA PENHA000000010
24.262962e+122016-04-29 16:19:042016-04-2962MATA DA PRAIA000000001
38.679512e+112016-04-29 17:29:312016-04-298PONTAL DE CAMBURI000000001
48.841186e+122016-04-29 16:07:232016-04-2956JARDIM DA PENHA011000001
\n", "
" ], "text/plain": [ " PatientID ScheduledDay AppointmentDay Age Neighbourhood \\\n", "0 2.987250e+13 2016-04-29 18:38:08 2016-04-29 62 JARDIM DA PENHA \n", "1 5.589978e+14 2016-04-29 16:08:27 2016-04-29 56 JARDIM DA PENHA \n", "2 4.262962e+12 2016-04-29 16:19:04 2016-04-29 62 MATA DA PRAIA \n", "3 8.679512e+11 2016-04-29 17:29:31 2016-04-29 8 PONTAL DE CAMBURI \n", "4 8.841186e+12 2016-04-29 16:07:23 2016-04-29 56 JARDIM DA PENHA \n", "\n", " Scholarship Hypertension Diabetes Alcoholism Handicap SMS_received \\\n", "0 0 1 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 1 1 0 0 0 \n", "\n", " NoShow Male Female \n", "0 0 0 1 \n", "1 0 1 0 \n", "2 0 0 1 \n", "3 0 0 1 \n", "4 0 0 1 " ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['ScheduledDay'] = pd.to_datetime(data['ScheduledDay'])\n", "data['AppointmentDay'] = pd.to_datetime(data['AppointmentDay'])\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"\\ndata['ScheduledYear'], data['ScheduledMonth'], data['ScheduleDay'] = data['ScheduledDay'].dt.year, data['ScheduledDay'].dt.month, data['ScheduledDay'].dt.day\\ndata['AppointmentYear'], data['AppointmentMonth'], data['AppointmentDayy'] = data['AppointmentDay'].dt.year, data['AppointmentDay'].dt.month, data['AppointmentDay'].dt.day\\ndata.head\\n\"" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#original features that I was using, got rid of. \n", "\"\"\"\n", "data['ScheduledYear'], data['ScheduledMonth'], data['ScheduleDay'] = data['ScheduledDay'].dt.year, data['ScheduledDay'].dt.month, data['ScheduledDay'].dt.day\n", "data['AppointmentYear'], data['AppointmentMonth'], data['AppointmentDayy'] = data['AppointmentDay'].dt.year, data['AppointmentDay'].dt.month, data['AppointmentDay'].dt.day\n", "data.head\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probably have to get rid of Neighbourhood column given we don't have the specefic hospital for all these NoShows. If we did we could have used distance from hospital as a feature. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a feature that calculates the wait time of a particular patient from when they schedule the appointment to when they actually have the appointment. I believe this will be a really great feature to have. One would assume that the longer the wait time the more likely people are to no show for their appointments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Waiting Time" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShowMaleFemaleWaitingTime
02.987250e+132016-04-292016-04-2962JARDIM DA PENHA0100000010 days
15.589978e+142016-04-292016-04-2956JARDIM DA PENHA0000000100 days
24.262962e+122016-04-292016-04-2962MATA DA PRAIA0000000010 days
38.679512e+112016-04-292016-04-298PONTAL DE CAMBURI0000000010 days
48.841186e+122016-04-292016-04-2956JARDIM DA PENHA0110000010 days
\n", "
" ], "text/plain": [ " PatientID ScheduledDay AppointmentDay Age Neighbourhood \\\n", "0 2.987250e+13 2016-04-29 2016-04-29 62 JARDIM DA PENHA \n", "1 5.589978e+14 2016-04-29 2016-04-29 56 JARDIM DA PENHA \n", "2 4.262962e+12 2016-04-29 2016-04-29 62 MATA DA PRAIA \n", "3 8.679512e+11 2016-04-29 2016-04-29 8 PONTAL DE CAMBURI \n", "4 8.841186e+12 2016-04-29 2016-04-29 56 JARDIM DA PENHA \n", "\n", " Scholarship Hypertension Diabetes Alcoholism Handicap SMS_received \\\n", "0 0 1 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 1 1 0 0 0 \n", "\n", " NoShow Male Female WaitingTime \n", "0 0 0 1 0 days \n", "1 0 1 0 0 days \n", "2 0 0 1 0 days \n", "3 0 0 1 0 days \n", "4 0 0 1 0 days " ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ScheduledDay = pd.DatetimeIndex(data.ScheduledDay).normalize()\n", "data['WaitingTime'] = data['AppointmentDay'] - data['ScheduledDay']\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIDScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedNoShowMaleFemaleWaitingTime
02.987250e+132016-04-292016-04-2962JARDIM DA PENHA0100000010
15.589978e+142016-04-292016-04-2956JARDIM DA PENHA0000000100
24.262962e+122016-04-292016-04-2962MATA DA PRAIA0000000010
38.679512e+112016-04-292016-04-298PONTAL DE CAMBURI0000000010
48.841186e+122016-04-292016-04-2956JARDIM DA PENHA0110000010
59.598513e+132016-04-272016-04-2976REPÚBLICA0100000012
67.336882e+142016-04-272016-04-2923GOIABEIRAS0000001012
73.449833e+122016-04-272016-04-2939GOIABEIRAS0000001012
85.639473e+132016-04-292016-04-2921ANDORINHAS0000000010
97.812456e+132016-04-272016-04-2919CONQUISTA0000000012
107.345362e+142016-04-272016-04-2930NOVA PALESTINA0000000012
117.542951e+122016-04-262016-04-2929NOVA PALESTINA0000011103
125.666548e+142016-04-282016-04-2922NOVA PALESTINA1000000011
139.113946e+142016-04-282016-04-2928NOVA PALESTINA0000000101
149.988472e+132016-04-282016-04-2954NOVA PALESTINA0000000011
159.994839e+102016-04-262016-04-2915NOVA PALESTINA0000010013
168.457439e+132016-04-282016-04-2950NOVA PALESTINA0000000101
171.479497e+132016-04-282016-04-2940CONQUISTA1000001011
181.713538e+132016-04-262016-04-2930NOVA PALESTINA1000010013
197.223289e+122016-04-292016-04-2946DA PENHA0000000010
\n", "
" ], "text/plain": [ " PatientID ScheduledDay AppointmentDay Age Neighbourhood \\\n", "0 2.987250e+13 2016-04-29 2016-04-29 62 JARDIM DA PENHA \n", "1 5.589978e+14 2016-04-29 2016-04-29 56 JARDIM DA PENHA \n", "2 4.262962e+12 2016-04-29 2016-04-29 62 MATA DA PRAIA \n", "3 8.679512e+11 2016-04-29 2016-04-29 8 PONTAL DE CAMBURI \n", "4 8.841186e+12 2016-04-29 2016-04-29 56 JARDIM DA PENHA \n", "5 9.598513e+13 2016-04-27 2016-04-29 76 REPÚBLICA \n", "6 7.336882e+14 2016-04-27 2016-04-29 23 GOIABEIRAS \n", "7 3.449833e+12 2016-04-27 2016-04-29 39 GOIABEIRAS \n", "8 5.639473e+13 2016-04-29 2016-04-29 21 ANDORINHAS \n", "9 7.812456e+13 2016-04-27 2016-04-29 19 CONQUISTA \n", "10 7.345362e+14 2016-04-27 2016-04-29 30 NOVA PALESTINA \n", "11 7.542951e+12 2016-04-26 2016-04-29 29 NOVA PALESTINA \n", "12 5.666548e+14 2016-04-28 2016-04-29 22 NOVA PALESTINA \n", "13 9.113946e+14 2016-04-28 2016-04-29 28 NOVA PALESTINA \n", "14 9.988472e+13 2016-04-28 2016-04-29 54 NOVA PALESTINA \n", "15 9.994839e+10 2016-04-26 2016-04-29 15 NOVA PALESTINA \n", "16 8.457439e+13 2016-04-28 2016-04-29 50 NOVA PALESTINA \n", "17 1.479497e+13 2016-04-28 2016-04-29 40 CONQUISTA \n", "18 1.713538e+13 2016-04-26 2016-04-29 30 NOVA PALESTINA \n", "19 7.223289e+12 2016-04-29 2016-04-29 46 DA PENHA \n", "\n", " Scholarship Hypertension Diabetes Alcoholism Handicap SMS_received \\\n", "0 0 1 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 1 1 0 0 0 \n", "5 0 1 0 0 0 0 \n", "6 0 0 0 0 0 0 \n", "7 0 0 0 0 0 0 \n", "8 0 0 0 0 0 0 \n", "9 0 0 0 0 0 0 \n", "10 0 0 0 0 0 0 \n", "11 0 0 0 0 0 1 \n", "12 1 0 0 0 0 0 \n", "13 0 0 0 0 0 0 \n", "14 0 0 0 0 0 0 \n", "15 0 0 0 0 0 1 \n", "16 0 0 0 0 0 0 \n", "17 1 0 0 0 0 0 \n", "18 1 0 0 0 0 1 \n", "19 0 0 0 0 0 0 \n", "\n", " NoShow Male Female WaitingTime \n", "0 0 0 1 0 \n", "1 0 1 0 0 \n", "2 0 0 1 0 \n", "3 0 0 1 0 \n", "4 0 0 1 0 \n", "5 0 0 1 2 \n", "6 1 0 1 2 \n", "7 1 0 1 2 \n", "8 0 0 1 0 \n", "9 0 0 1 2 \n", "10 0 0 1 2 \n", "11 1 1 0 3 \n", "12 0 0 1 1 \n", "13 0 1 0 1 \n", "14 0 0 1 1 \n", "15 0 0 1 3 \n", "16 0 1 0 1 \n", "17 1 0 1 1 \n", "18 0 0 1 3 \n", "19 0 0 1 0 " ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['WaitingTime'] = data['WaitingTime'].apply(lambda x: x.days)\n", "data.head(20)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10.183701719941734" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['WaitingTime'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Need to make one last clean dataset picking which features I will actually use in the model. " ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NoShowScheduledDayAppointmentDayAgeScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedMaleFemaleWaitingTime
002016-04-292016-04-2962010000010
102016-04-292016-04-2956000000100
202016-04-292016-04-2962000000010
302016-04-292016-04-298000000010
402016-04-292016-04-2956011000010
\n", "
" ], "text/plain": [ " NoShow ScheduledDay AppointmentDay Age Scholarship Hypertension \\\n", "0 0 2016-04-29 2016-04-29 62 0 1 \n", "1 0 2016-04-29 2016-04-29 56 0 0 \n", "2 0 2016-04-29 2016-04-29 62 0 0 \n", "3 0 2016-04-29 2016-04-29 8 0 0 \n", "4 0 2016-04-29 2016-04-29 56 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received Male Female WaitingTime \n", "0 0 0 0 0 0 1 0 \n", "1 0 0 0 0 1 0 0 \n", "2 0 0 0 0 0 1 0 \n", "3 0 0 0 0 0 1 0 \n", "4 1 0 0 0 0 1 0 " ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data[['NoShow','ScheduledDay','AppointmentDay','Age','Scholarship','Hypertension','Diabetes','Alcoholism','Handicap','SMS_received','Male','Female','WaitingTime']]\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Models
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is probably the most interesting and complex part to this analysis. \n", "\n", "I will run a bunch of different models to see what their various testing/training accuracies are in order to pick the best one and then try and optomize that particular model." ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NoShowScheduledDayAppointmentDayAgeScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedMaleFemaleWaitingTime
002016-04-292016-04-2962010000010
102016-04-292016-04-2956000000100
202016-04-292016-04-2962000000010
302016-04-292016-04-298000000010
402016-04-292016-04-2956011000010
\n", "
" ], "text/plain": [ " NoShow ScheduledDay AppointmentDay Age Scholarship Hypertension \\\n", "0 0 2016-04-29 2016-04-29 62 0 1 \n", "1 0 2016-04-29 2016-04-29 56 0 0 \n", "2 0 2016-04-29 2016-04-29 62 0 0 \n", "3 0 2016-04-29 2016-04-29 8 0 0 \n", "4 0 2016-04-29 2016-04-29 56 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received Male Female WaitingTime \n", "0 0 0 0 0 0 1 0 \n", "1 0 0 0 0 1 0 0 \n", "2 0 0 0 0 0 1 0 \n", "3 0 0 0 0 0 1 0 \n", "4 1 0 0 0 0 1 0 " ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedMaleFemaleWaitingTime
062010000010
156000000100
262000000010
38000000010
456011000010
\n", "
" ], "text/plain": [ " Age Scholarship Hypertension Diabetes Alcoholism Handicap \\\n", "0 62 0 1 0 0 0 \n", "1 56 0 0 0 0 0 \n", "2 62 0 0 0 0 0 \n", "3 8 0 0 0 0 0 \n", "4 56 0 1 1 0 0 \n", "\n", " SMS_received Male Female WaitingTime \n", "0 0 0 1 0 \n", "1 0 1 0 0 \n", "2 0 0 1 0 \n", "3 0 0 1 0 \n", "4 0 0 1 0 " ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#the features we are going to train our model on\n", "showfeatures = data.iloc[:,3:19]\n", "showfeatures.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split, cross_val_score\n", "from sklearn.metrics import accuracy_score\n" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[62, 0, 1, ..., 0, 1, 0],\n", " [56, 0, 0, ..., 1, 0, 0],\n", " [62, 0, 0, ..., 0, 1, 0],\n", " ..., \n", " [21, 0, 0, ..., 0, 1, 41],\n", " [38, 0, 0, ..., 0, 1, 41],\n", " [54, 0, 0, ..., 0, 1, 41]])" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# TRAIN/TEST PARTITION\n", "#create the feature vectors and class labels\n", "\n", "\n", "features = np.array(showfeatures)\n", "labels = np.array(data['NoShow'])\n", "\n", "features" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 88208\n", "1 22319\n", "Name: NoShow, dtype: int64" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['NoShow'].value_counts()" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#split the data into training and testing sets(67% training, 33% into testing)\n", "training_features, testing_features, training_labels, testing_labels = train_test_split(features,labels, test_size = 0.2,random_state = 42 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic regression" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import linear_model\n", "\n", "lg = linear_model.LogisticRegression()\n", "lg.fit(training_features,training_labels)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.7955445 0.79463983 0.79359873 0.79246777 0.79552138 0.79518209\n", " 0.79552138 0.79371183 0.79416422 0.79199186]\n", "[0 0 0 ..., 0 0 0]\n", "0.795304442233\n" ] } ], "source": [ "# hold out cross validation\n", "\n", "#print(accuracy_score(testing_labels, predictions))\n", "\n", "score = cross_val_score(lg, training_features,training_labels, cv = 10, scoring= 'accuracy')\n", "print(score)\n", "\n", "predictions = lg.predict(testing_features)\n", "print(predictions)\n", "\n", "score = accuracy_score(testing_labels, predictions)\n", "print(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi Layer Perceptron just for fun, takes forever to run" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'from sklearn.neural_network import MLPClassifier\\nmlp = MLPClassifier(alpha=0.01, hidden_layer_sizes = (100,100))\\nmlp.fit(training_features,training_labels)'" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"from sklearn.neural_network import MLPClassifier\n", "mlp = MLPClassifier(alpha=0.01, hidden_layer_sizes = (100,100))\n", "mlp.fit(training_features,training_labels)\"\"\"" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"# hold out cross validation\\n\\n#print(accuracy_score(testing_labels, predictions))\\n\\nscore = cross_val_score(mlp, training_features,training_labels, cv = 10, scoring= 'accuracy')\\nprint(score)\\n\\npredictions = mlp.predict(testing_features)\\nprint(predictions)\\n\\nscore = accuracy_score(testing_labels, predictions)\\nprint(score)\"" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"# hold out cross validation\n", "\n", "#print(accuracy_score(testing_labels, predictions))\n", "\n", "score = cross_val_score(mlp, training_features,training_labels, cv = 10, scoring= 'accuracy')\n", "print(score)\n", "\n", "predictions = mlp.predict(testing_features)\n", "print(predictions)\n", "\n", "score = accuracy_score(testing_labels, predictions)\n", "print(score)\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forrest" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=2, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "rf = RandomForestClassifier(max_depth=2,n_estimators=100)\n", "rf.fit(training_features,training_labels)" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.79769309 0.79769309 0.79778331 0.79778331 0.79778331 0.79778331\n", " 0.79778331 0.79778331 0.79778331 0.79776043]\n", "[0 0 0 ..., 0 0 0]\n", "0.79928526192\n" ] } ], "source": [ "# hold out cross validation\n", "\n", "#print(accuracy_score(testing_labels, predictions))\n", "\n", "score = cross_val_score(rf, training_features,training_labels, cv = 10, scoring= 'accuracy')\n", "print(score)\n", "\n", "predictions = rf.predict(testing_features)\n", "print(predictions)\n", "\n", "score = accuracy_score(testing_labels, predictions)\n", "print(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## KNeighbors" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#TRAIN/TEST ALGORITHM\n", "#instance the model\n", "\n", "kList = range(1,50)\n", "\n", "cv_scores = []\n", "\n", "neighbors = filter(lambda x: x % 2 != 0, kList)" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "3\n", "5\n", "7\n", "9\n", "11\n", "13\n", "15\n", "17\n", "19\n", "21\n", "23\n", "25\n", "27\n", "29\n", "31\n", "33\n", "35\n", "37\n", "39\n", "41\n", "43\n", "45\n", "47\n", "49\n", "done finding\n" ] } ], "source": [ "\n", "for i in neighbors:\n", " print(i)\n", " knn = KNeighborsClassifier(n_neighbors=i)\n", "\n", " knn.fit(training_features,training_labels)\n", "\n", " #test the model\n", " predictions = knn.predict(testing_features)\n", "\n", " # hold out cross validation\n", "\n", " #print(accuracy_score(testing_labels, predictions))\n", "\n", "\n", " scores = cross_val_score(knn, training_features,training_labels, cv = 10, scoring= 'accuracy')\n", " #print(scores)\n", " cv_scores.append(scores.mean())\n", "\n", "print(\"done finding\") " ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": true }, "outputs": [], "source": [ "optimalk = cv_scores.index(max(cv_scores))" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=1, n_neighbors=24, p=2,\n", " weights='uniform')" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn = KNeighborsClassifier(n_neighbors=optimalk)\n", "knn.fit(training_features,training_labels)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.795711571519\n" ] } ], "source": [ "\n", "predictions = knn.predict(testing_features)\n", "#data['predictions'] = knn.predict(testing_features)\n", "\n", "\n", "score = accuracy_score(testing_labels, predictions)\n", "print(score)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict Whether a Patient Will Show Up: User Input" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NoShowScheduledDayAppointmentDayAgeScholarshipHypertensionDiabetesAlcoholismHandicapSMS_receivedMaleFemaleWaitingTime
002016-04-292016-04-2962010000010
102016-04-292016-04-2956000000100
202016-04-292016-04-2962000000010
302016-04-292016-04-298000000010
402016-04-292016-04-2956011000010
\n", "
" ], "text/plain": [ " NoShow ScheduledDay AppointmentDay Age Scholarship Hypertension \\\n", "0 0 2016-04-29 2016-04-29 62 0 1 \n", "1 0 2016-04-29 2016-04-29 56 0 0 \n", "2 0 2016-04-29 2016-04-29 62 0 0 \n", "3 0 2016-04-29 2016-04-29 8 0 0 \n", "4 0 2016-04-29 2016-04-29 56 0 1 \n", "\n", " Diabetes Alcoholism Handicap SMS_received Male Female WaitingTime \n", "0 0 0 0 0 0 1 0 \n", "1 0 0 0 0 1 0 0 \n", "2 0 0 0 0 0 1 0 \n", "3 0 0 0 0 0 1 0 \n", "4 1 0 0 0 0 1 0 " ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enter the patient's age: \n", "5\n", "Is the patient on Scholarship(Yes or No): \n", "Yes\n", "Does the patient have Hypertension?(Yes or No): \n", "Yes\n", "Is the patient Diabetic?(Yes or No): \n", "Yes\n", "Is the patient an Alcoholic?(Yes or No): \n", "Yes\n", "How many Handicaps does the patient have?: \n", "5\n", "Will the patient recieve a text message?(Yes or No): \n", "Yes\n", "What is the patient gender?(Male or Female): \n", "Female\n", "How many days away is the appointment?: \n", "10\n", " \n", "user answer: \n", "[5, 1.0, 1.0, 1.0, 1.0, 5, 1.0, 1.0, 0.0, 10]\n", "[0]\n", "[0]\n", "[0]\n" ] } ], "source": [ "age = 0.0\n", "scholarship = 0.0\n", "hypertension = 0.0\n", "diabetes = 0.0\n", "alcoholism = 0.0\n", "handicap = 0.0\n", "sms = 0.0\n", "male = 0.0\n", "female = 0.0\n", "waitingtime = 0\n", "\n", "\n", "\n", "inputage = int(input(\"Enter the patient's age: \"+ \"\\n\" ))\n", "inputscholarship = input(\"Is the patient on Scholarship(Yes or No): \"+ \"\\n\" )\n", "inputhypertension = input(\"Does the patient have Hypertension?(Yes or No): \"+ \"\\n\" )\n", "inputdiabetes = input(\"Is the patient Diabetic?(Yes or No): \"+ \"\\n\" )\n", "inputalcoholism = input(\"Is the patient an Alcoholic?(Yes or No): \"+ \"\\n\" )\n", "inputhandicap = int(input(\"How many Handicaps does the patient have?: \"+ \"\\n\" ))\n", "inputsms = input(\"Will the patient recieve a text message?(Yes or No): \"+ \"\\n\" )\n", "inputgender = input(\"What is the patient gender?(Male or Female): \"+ \"\\n\")\n", "\n", "inputwaitingtime = int(input(\"How many days away is the appointment?: \"+ \"\\n\"))\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "age = inputage\n", "handicap = inputhandicap\n", "waitingtime = inputwaitingtime\n", "\n", "\n", "\n", "\n", "if inputscholarship == \"Yes\" or \"yes\":\n", " scholarship = 1.0\n", "elif inputscholarship == \"No\" or \"no\":\n", " scholarship = 0\n", "\n", "if hypertension == \"Yes\" or \"yes\":\n", " hypertension = 1.0\n", "elif hypertension == \"No\" or \"no\":\n", " hypertension = 0\n", "if inputdiabetes == \"Yes\" or \"yes\":\n", " diabetes = 1.0\n", "elif inputdiabetes == \"No\" or \"no\":\n", " daibetes = 0\n", "if inputalcoholism == \"Yes\" or \"yes\":\n", " alcoholism = 1.0\n", "elif inputalcoholism == \"No\" or \"no\":\n", " alcoholism = 0\n", "if inputsms == \"Yes\" or \"yes\":\n", " sms = 1.0\n", "elif inputsms == \"No\" or \"no\":\n", " sms = 0\n", "if inputgender == \"Male\" or \"M\":\n", " male = 1.0\n", "elif inputgender == \"Female\" or \"F\":\n", " female = 1.0\n", "else: \n", " print(\"incorrect gender input\")\n", " \n", "\n", "\n", "answer = [age,scholarship,hypertension,diabetes,alcoholism,handicap,sms,male,female,waitingtime]\n", "\n", "#the commented out chunk was for testing to not have to type everything in everytime I wanted to test\n", "\"\"\"\n", "print(\" \")\n", "\n", "print(\"Fixed answer: \")\n", "\n", "fixedanswer = [5, 1.0, 1.0 ,1.0, 1.0, 4, 0.0, 1.0, 1.0,10]\n", "\n", "print(fixedanswer)\n", "\n", "print(rf.predict([fixedanswer]))\n", "print(lg.predict([fixedanswer]))\n", "print(knn.predict([fixedanswer]))\n", "\"\"\"\n", "\n", "print(\" \")\n", "\n", "print(\"user answer: \")\n", "\n", "\n", "print(answer)\n", "\n", "print(rf.predict([answer]))\n", "print(lg.predict([answer]))\n", "print(knn.predict([answer]))\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Esemble Model" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Here we will use an ensemble of our three different models to vote whether this patient will show up or not.\n", "Let's see how our individual models voted.\n", "\n", "random forrest:\n", "0\n", "show\n", "\n", "logistic regression:\n", "0\n", "show\n", "\n", "knn:\n", "0\n", "show\n", "\n", "The final vote is 0.000000 no shows to 3 shows\n" ] } ], "source": [ "print(\"Here we will use an ensemble of our three different models to vote whether this patient will show up or not.\")\n", "print(\"Let's see how our individual models voted.\")\n", "noshowcounter = 0.0\n", "showcounter = 0.0\n", "\n", "randomresult = (rf.predict([answer]))\n", "logresult = (lg.predict([answer]))\n", "knnresult = (knn.predict([answer]))\n", "\n", "print(\"\")\n", "print(\"random forrest:\")\n", "print(randomresult[0])\n", "\n", "if randomresult[0] == 1:\n", " noshowcounter+=1\n", " print(\"no show\")\n", "else:\n", " showcounter+=1\n", " print(\"show\")\n", "\n", "print(\"\")\n", "print(\"logistic regression:\")\n", "print(logresult[0])\n", "\n", "if logresult[0] == 1:\n", " noshowcounter+=1\n", " print(\"no show\")\n", "else:\n", " showcounter+=1\n", " print(\"show\")\n", "\n", " \n", "print(\"\")\n", "print(\"knn:\")\n", "print(knnresult[0])\n", "\n", "if knnresult[0] == 1:\n", " noshowcounter+=1\n", " print(\"no show\")\n", "else:\n", " showcounter+=1\n", " print(\"show\")\n", "\n", "\n", " \n", "print(\"\") \n", "print(\"The final vote is %f no shows to %d shows\" %(noshowcounter,showcounter))\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis of Ensemble Model:\n", "What you can see here is that there needs to be more descrepencies or more features because an esenmble doesn't really help here because they all vote the same" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Result" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Our Model Predicts that this patient will show up to their appointment, intervention is not needed\n" ] } ], "source": [ "\n", "if noshowcounter >= 2:\n", " print(\"Our Esemble Model predicts the patient is not going to show up for their appointment.\")\n", " print(\"Be advised, intervention may be neccesary or suggested in order for the patient to show up \")\n", "\n", "else:\n", " print(\"Our Model Predicts that this patient will show up to their appointment, intervention is not needed\")\n", "\n" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 ..., 0 0 0]\n", "[0 0 0 ..., 0 0 0]\n", "[0 0 0 ..., 0 0 0]\n", "\n", "\n", "done with for loop\n" ] } ], "source": [ "rfpredictions = rf.predict(testing_features)\n", "lgpredictions = lg.predict(testing_features)\n", "knnpredictions = knn.predict(testing_features)\n", "\n", "print(rfpredictions)\n", "print(lgpredictions)\n", "print(knnpredictions)\n", "print(\"\")\n", "\n", "lengthoflists = (len(rfpredictions))\n", "\n", "ensemblepredictions = []\n", "\n", "ensemblescore = 0\n", "\n", "for i in range(len(rfpredictions)):\n", " \n", " if rfpredictions[i] == 1:\n", " ensemblescore+=1\n", " if lgpredictions[i] == 1:\n", " ensemblescore+=1\n", " if knnpredictions[i] == 1:\n", " ensemblescore+=1\n", " \n", " if ensemblescore >= 2:\n", " ensemblescore = 1\n", " else:\n", " ensemblescore = 0\n", " \n", " ensemblepredictions.append(ensemblescore)\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "print(\"\") \n", "print(\"done with for loop\")\n", "\n" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "0.799149552158\n" ] } ], "source": [ "print(\"\") \n", "#uncomment if you want to see my ensemble predicting basically all zeros.\n", "#print(ensemblepredictions)\n", "\n", "ensemblescore = accuracy_score(testing_labels, ensemblepredictions)\n", "print(ensemblescore)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confusion Matrix" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[17658, 11],\n", " [ 4429, 8]])" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "confusion_matrix(testing_labels, ensemblepredictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Error Analysis:\n", "I was a bit confused what was going on here and why all of the predictions were zeros but what we learned is that just because you have these features doesn't neccesarily mean you can convert them into a yes or no no show. Rather indiviudal classifiers and ensemble model learned if you just say people are always going to show up you will get an 80% accuracy\n", "\n", "That being said, you can still predict the probability at which individuals may or may not show up using predicta proba on logistic regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predicting the probability of an individual patient not showing up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Patient #1: 5 year old female with a lot of medical \"problems\"" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enter the patient's age: \n", "5\n", "Is the patient on Scholarship(Yes or No): \n", "Yes\n", "Does the patient have Hypertension?(Yes or No): \n", "Yes\n", "Is the patient Diabetic?(Yes or No): \n", "Yes\n", "Is the patient an Alcoholic?(Yes or No): \n", "Yes\n", "How many Handicaps does the patient have?: \n", "5\n", "Will the patient recieve a text message?(Yes or No): \n", "Yes\n", "What is the patient gender?(Male or Female): \n", "Female\n", "How many days away is the appointment?: \n", "10\n", " \n", " \n", "user answer: \n", "[5, 1.0, 1.0, 1.0, 1.0, 5, 1.0, 1.0, 0.0, 10]\n" ] } ], "source": [ "age = 0.0\n", "scholarship = 0.0\n", "hypertension = 0.0\n", "diabetes = 0.0\n", "alcoholism = 0.0\n", "handicap = 0.0\n", "sms = 0.0\n", "male = 0.0\n", "female = 0.0\n", "waitingtime = 0\n", "\n", "\n", "\n", "inputage = int(input(\"Enter the patient's age: \"+ \"\\n\" ))\n", "inputscholarship = input(\"Is the patient on Scholarship(Yes or No): \"+ \"\\n\" )\n", "inputhypertension = input(\"Does the patient have Hypertension?(Yes or No): \"+ \"\\n\" )\n", "inputdiabetes = input(\"Is the patient Diabetic?(Yes or No): \"+ \"\\n\" )\n", "inputalcoholism = input(\"Is the patient an Alcoholic?(Yes or No): \"+ \"\\n\" )\n", "inputhandicap = int(input(\"How many Handicaps does the patient have?: \"+ \"\\n\" ))\n", "inputsms = input(\"Will the patient recieve a text message?(Yes or No): \"+ \"\\n\" )\n", "inputgender = input(\"What is the patient gender?(Male or Female): \"+ \"\\n\")\n", "\n", "inputwaitingtime = int(input(\"How many days away is the appointment?: \"+ \"\\n\"))\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "age = inputage\n", "handicap = inputhandicap\n", "waitingtime = inputwaitingtime\n", "\n", "\n", "\n", "\n", "if inputscholarship == \"Yes\" or \"yes\":\n", " scholarship = 1.0\n", "elif inputscholarship == \"No\" or \"no\":\n", " scholarship = 0\n", "\n", "if hypertension == \"Yes\" or \"yes\":\n", " hypertension = 1.0\n", "elif hypertension == \"No\" or \"no\":\n", " hypertension = 0\n", "if inputdiabetes == \"Yes\" or \"yes\":\n", " diabetes = 1.0\n", "elif inputdiabetes == \"No\" or \"no\":\n", " daibetes = 0\n", "if inputalcoholism == \"Yes\" or \"yes\":\n", " alcoholism = 1.0\n", "elif inputalcoholism == \"No\" or \"no\":\n", " alcoholism = 0\n", "if inputsms == \"Yes\" or \"yes\":\n", " sms = 1.0\n", "elif inputsms == \"No\" or \"no\":\n", " sms = 0\n", "if inputgender == \"Male\" or \"M\":\n", " male = 1.0\n", "elif inputgender == \"Female\" or \"F\":\n", " female = 1.0\n", "else: \n", " print(\"incorrect gender input\")\n", " \n", "\n", "\n", "answer = [age,scholarship,hypertension,diabetes,alcoholism,handicap,sms,male,female,waitingtime]\n", "\n", "#the commented out chunk was for testing to not have to type everything in everytime I wanted to test\n", "\n", "\n", "print(\" \")\n", "\"\"\"\n", "print(\"Fixed answer: \")\n", "\n", "fixedanswer = [100, 1.0, 1.0 ,1.0, 1.0, 5, 0.0, 1.0, 1.0,10]\n", "\n", "print(fixedanswer)\n", "\n", "\n", "#print(rf.predict([fixedanswer]))\n", "#print(lg.predict([fixedanswer]))\n", "#print(knn.predict([fixedanswer]))\n", "\"\"\"\n", "\n", "print(\" \")\n", "\n", "print(\"user answer: \")\n", "\n", "\n", "print(answer)\n", "\n" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There is a 0.540198 percent chance patient #1 does show up and a 0.459802 percent chance this person doesn't show up\n" ] } ], "source": [ "\n", "probabilityresult = lg.predict_proba([answer])\n", "\n", "showupproba = probabilityresult[0]\n", "showupproba = showupproba[0]\n", "\n", "\n", "noshowproba = probabilityresult[0]\n", "noshowproba = noshowproba[1]\n", "\n", "print(\"There is a %f percent chance patient #1 does show up and a %f percent chance this person doesn't show up\" % (showupproba,noshowproba))\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Patient #2: 100 year old female with the same lot of medical \"problems\"" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enter the patient's age: \n", "100\n", "Is the patient on Scholarship(Yes or No): \n", "Yes\n", "Does the patient have Hypertension?(Yes or No): \n", "Yes\n", "Is the patient Diabetic?(Yes or No): \n", "Yes\n", "Is the patient an Alcoholic?(Yes or No): \n", "Yes\n", "How many Handicaps does the patient have?: \n", "5\n", "Will the patient recieve a text message?(Yes or No): \n", "Yes\n", "What is the patient gender?(Male or Female): \n", "Female\n", "How many days away is the appointment?: \n", "10\n", " \n", " \n", "user answer: \n", "[100, 1.0, 1.0, 1.0, 1.0, 5, 1.0, 1.0, 0.0, 10]\n" ] } ], "source": [ "age = 0.0\n", "scholarship = 0.0\n", "hypertension = 0.0\n", "diabetes = 0.0\n", "alcoholism = 0.0\n", "handicap = 0.0\n", "sms = 0.0\n", "male = 0.0\n", "female = 0.0\n", "waitingtime = 0\n", "\n", "\n", "\n", "inputage = int(input(\"Enter the patient's age: \"+ \"\\n\" ))\n", "inputscholarship = input(\"Is the patient on Scholarship(Yes or No): \"+ \"\\n\" )\n", "inputhypertension = input(\"Does the patient have Hypertension?(Yes or No): \"+ \"\\n\" )\n", "inputdiabetes = input(\"Is the patient Diabetic?(Yes or No): \"+ \"\\n\" )\n", "inputalcoholism = input(\"Is the patient an Alcoholic?(Yes or No): \"+ \"\\n\" )\n", "inputhandicap = int(input(\"How many Handicaps does the patient have?: \"+ \"\\n\" ))\n", "inputsms = input(\"Will the patient recieve a text message?(Yes or No): \"+ \"\\n\" )\n", "inputgender = input(\"What is the patient gender?(Male or Female): \"+ \"\\n\")\n", "\n", "inputwaitingtime = int(input(\"How many days away is the appointment?: \"+ \"\\n\"))\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "age = inputage\n", "handicap = inputhandicap\n", "waitingtime = inputwaitingtime\n", "\n", "\n", "\n", "\n", "if inputscholarship == \"Yes\" or \"yes\":\n", " scholarship = 1.0\n", "elif inputscholarship == \"No\" or \"no\":\n", " scholarship = 0\n", "\n", "if hypertension == \"Yes\" or \"yes\":\n", " hypertension = 1.0\n", "elif hypertension == \"No\" or \"no\":\n", " hypertension = 0\n", "if inputdiabetes == \"Yes\" or \"yes\":\n", " diabetes = 1.0\n", "elif inputdiabetes == \"No\" or \"no\":\n", " daibetes = 0\n", "if inputalcoholism == \"Yes\" or \"yes\":\n", " alcoholism = 1.0\n", "elif inputalcoholism == \"No\" or \"no\":\n", " alcoholism = 0\n", "if inputsms == \"Yes\" or \"yes\":\n", " sms = 1.0\n", "elif inputsms == \"No\" or \"no\":\n", " sms = 0\n", "if inputgender == \"Male\" or \"M\":\n", " male = 1.0\n", "elif inputgender == \"Female\" or \"F\":\n", " female = 1.0\n", "else: \n", " print(\"incorrect gender input\")\n", " \n", "\n", "\n", "answer = [age,scholarship,hypertension,diabetes,alcoholism,handicap,sms,male,female,waitingtime]\n", "\n", "#the commented out chunk was for testing to not have to type everything in everytime I wanted to test\n", "\n", "\n", "print(\" \")\n", "\"\"\"\n", "print(\"Fixed answer: \")\n", "\n", "fixedanswer = [100, 1.0, 1.0 ,1.0, 1.0, 5, 0.0, 1.0, 1.0,10]\n", "\n", "print(fixedanswer)\n", "\n", "\n", "#print(rf.predict([fixedanswer]))\n", "#print(lg.predict([fixedanswer]))\n", "#print(knn.predict([fixedanswer]))\n", "\"\"\"\n", "\n", "print(\" \")\n", "\n", "print(\"user answer: \")\n", "\n", "\n", "print(answer)\n" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There is a 0.707785 percent chance patient #1 does show up and a 0.292215 percent chance this person doesn't show up\n" ] } ], "source": [ "probabilityresult = lg.predict_proba([answer])\n", "\n", "showupproba = probabilityresult[0]\n", "showupproba = showupproba[0]\n", "\n", "\n", "noshowproba = probabilityresult[0]\n", "noshowproba = noshowproba[1]\n", "\n", "print(\"There is a %f percent chance patient #1 does show up and a %f percent chance this person doesn't show up\" % (showupproba,noshowproba))\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis of different patients\n", "\n", "Age seems to be a big huge factor which makes sense because children aren't able to get to appointments themselves. Also if a child is 5 and is an alcholic i'm pretty sure that would mean they would have an alcholic and unrealiable parents which would make sense that they would have a way less likelyhood to show up than a 100 year old with the same features. \n", "\n", "While a 16% difference in showing up or not may not be the most instiutive or knowledgable thing in the world we can \"empirically\" say these people are more or less likely to show up than one another. \n", "\n", "side note I tried different training testing partitions(10/90 and 70/30) and that didn't really have any affect on my accuracy levels. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }