{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import math\n", "\n", "plt.style.use(\"bmh\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "## Background\n", "\n", "On weekdays, I commute using either train or TransJakarta. Since those are public transportations, we can't expect the vehicle has enough space for all passengers (yes, I don't even mention *comfortable*). Thus, sometimes I have to wait for the next train or bus, wishing that there is more space to get in. The question is, will the train/bus be spacious enough so that I could get in?\n", "\n", "## Research Questions\n", "\n", "How likely will I get in a bus after waiting for certain minutes?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Assumptions and Known Limitations\n", "\n", "- This analysis focuses on bus, not train, assuming that I don't have any other commuting options if I am already waiting at the train station. Meanwhile, if the bus doesn't arrive, I can use ride-hailing service.\n", "- The survival analysis is based on Kaplan-Meier estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation\n", "\n", "The dataset contains 1,000 observations which simulate bus waiting time, which ranges from 1 to 45 minutes. `observed` attribute describes whether I get in the bus or not after waiting for `wait_time` minutes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from random import randint\n", "import datetime\n", "\n", "n = 1000 # number of observations\n", "start_date = datetime.datetime(2018,1,1,8,0,0)\n", "wait_time = []\n", "end_date = []\n", "\n", "for i in range(n):\n", " date = start_date + datetime.timedelta(minutes=randint(1,45))\n", " wait_time.append((date.hour-8)*60 + date.minute)\n", " end_date.append(date.strftime(format='%H:%M'))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(\n", " data = {\n", " 'start_time': np.repeat(start_date.strftime(format='%H:%M'), n),\n", " 'end_time': end_date,\n", " 'wait_time': np.array(wait_time).T,\n", " 'observed': np.random.binomial(n=1, p=.5, size=n)\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset preview:\n" ] }, { "data": { "text/html": [ "
\n", " | start_time | \n", "end_time | \n", "wait_time | \n", "observed | \n", "
---|---|---|---|---|
995 | \n", "08:00 | \n", "08:20 | \n", "20 | \n", "1 | \n", "
996 | \n", "08:00 | \n", "08:15 | \n", "15 | \n", "1 | \n", "
997 | \n", "08:00 | \n", "08:42 | \n", "42 | \n", "1 | \n", "
998 | \n", "08:00 | \n", "08:11 | \n", "11 | \n", "1 | \n", "
999 | \n", "08:00 | \n", "08:18 | \n", "18 | \n", "0 | \n", "