{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\"GMV\n", "\"UPM\n", "

EDA: EMT dataset 🚦

\n", "
INESDATA-MOV
\n", "
\n", "\n", "\n", "# AnΓ‘lisis EDA\n", "\n", "En este cuaderno se realiza el anΓ‘lisis de datos exploratorio (EDA) del dataset de la [EMT](https://www.emtmadrid.es/Home). Una vez realizado el anΓ‘lisis de calidad se estudiarΓ‘ para el dataset resultante:\n", "\n", "* Valores nulos y unicidad de las variables\n", "* Correlaciones entre las variables\n", "* Correlaciones entre las variables y la variable de estudio `estimateArrive`\n", "\n", "Con ello se pretende hacer un filtrado de las variables que no dispongan de informaciΓ³n relevante para el modelo o variables con la misma informaciΓ³n. Por otro lado se harΓ‘ un estudio de los valores nulos para, en caso de que sea pertinente, reconstruir estos datos. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import statistics\n", "import random\n", "import polars as pl\n", "import pandas as pd\n", "import seaborn as sns\n", "import polars.selectors as cs\n", "import matplotlib.pyplot as plt\n", "from datetime import datetime, timedelta\n", "\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "ROOT_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(os.getcwd()))))\n", "DATA_PATH = os.path.join(ROOT_PATH, \"data\", \"processed\")\n", "EMT_DATA_PATH = os.path.join(DATA_PATH, \"emt\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sample_data = pl.scan_csv(os.path.join(EMT_DATA_PATH, \"2024\", \"03\",\"13\", f\"emt_20240313.csv\"))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1_134_462, 19)
datedatetimebuslinestoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdestinationdeviationStartTimeStopTimeMinimunFrequencyMaximumFrequencyisHeaddayTypestrikeestimateArrive
strstri64stri64f64f64i64i64stri64strstri64i64boolstrstri64
"2024-03-13""2024-03-13 08:…513"27"56-3.69054240.42373901841"PLAZA CASTILLA…0"05:55""23:30"312false"LA""N"473
"2024-03-13""2024-03-13 08:…521"27"56-3.68901940.42901101221"PLAZA CASTILLA…0"05:55""23:30"312false"LA""N"313
"2024-03-13""2024-03-13 08:…2549"150"56-3.69160840.42136602081"VIRGEN CORTIJO…0nullnullnullnullfalsenullnull547
"2024-03-13""2024-03-13 08:…2561"150"56-3.69841140.41801302779"VIRGEN CORTIJO…0nullnullnullnullfalsenullnull1080
"2024-03-13""2024-03-13 08:…5571"14"56-3.68791840.432850679"PIO XII"0nullnullnullnullfalsenullnull197
"2024-03-13""2024-03-13 22:…2141"174"51023-3.66704840.48913608114"VALDEBEBAS"0"06:00""23:45"722false"LA""N"1225
"2024-03-13""2024-03-13 22:…8810"171"510230.00.0013290"VALDEBEBAS"0nullnullnullnullfalsenullnull2163
"2024-03-13""2024-03-13 22:…8829"171"510230.00.002088"VALDEBEBAS"0nullnullnullnullfalsenullnull396
"2024-03-13""2024-03-13 22:…2290"174"3256-3.62309240.48233013847"VALDEBEBAS"0"06:00""23:45"722false"LA""N"1960
"2024-03-13""2024-03-13 22:…8879"174"3256-3.67613740.4696204891"VALDEBEBAS"0"06:00""23:45"722false"LA""N"746
" ], "text/plain": [ "shape: (1_134_462, 19)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ date ┆ datetime ┆ bus ┆ line ┆ … ┆ isHead ┆ dayType ┆ strike ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- β”‚\n", "β”‚ str ┆ str ┆ i64 ┆ str ┆ ┆ bool ┆ str ┆ str ┆ i64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════β•ͺ══════β•ͺ══════β•ͺ═══β•ͺ════════β•ͺ═════════β•ͺ════════β•ͺ════════════════║\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 513 ┆ 27 ┆ … ┆ false ┆ LA ┆ N ┆ 473 β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 521 ┆ 27 ┆ … ┆ false ┆ LA ┆ N ┆ 313 β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2549 ┆ 150 ┆ … ┆ false ┆ null ┆ null ┆ 547 β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2561 ┆ 150 ┆ … ┆ false ┆ null ┆ null ┆ 1080 β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 5571 ┆ 14 ┆ … ┆ false ┆ null ┆ null ┆ 197 β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2141 ┆ 174 ┆ … ┆ false ┆ LA ┆ N ┆ 1225 β”‚\n", "β”‚ ┆ 22:59:07.770061 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 8810 ┆ 171 ┆ … ┆ false ┆ null ┆ null ┆ 2163 β”‚\n", "β”‚ ┆ 22:59:07.770061 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 8829 ┆ 171 ┆ … ┆ false ┆ null ┆ null ┆ 396 β”‚\n", "β”‚ ┆ 22:59:07.770061 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2290 ┆ 174 ┆ … ┆ false ┆ LA ┆ N ┆ 1960 β”‚\n", "β”‚ ┆ 22:59:36.378432 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 8879 ┆ 174 ┆ … ┆ false ┆ LA ┆ N ┆ 746 β”‚\n", "β”‚ ┆ 22:59:36.378432 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variables numΓ©ricas" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 10)
busstoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdeviationMinimunFrequencyMaximumFrequencyestimateArrive
i64i64f64f64i64i64i64i64i64i64
51356-3.69054240.423739018410312473
52156-3.68901940.429011012210312313
254956-3.69160840.421366020810nullnull547
256156-3.69841140.418013027790nullnull1080
557156-3.68791840.4328506790nullnull197
" ], "text/plain": [ "shape: (5, 10)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ bus ┆ stop ┆ positionBus ┆ positionBus ┆ … ┆ deviation ┆ MinimunFreq ┆ MaximumFreq ┆ estimateAr β”‚\n", "β”‚ --- ┆ --- ┆ Lon ┆ Lat ┆ ┆ --- ┆ uency ┆ uency ┆ rive β”‚\n", "β”‚ i64 ┆ i64 ┆ --- ┆ --- ┆ ┆ i64 ┆ --- ┆ --- ┆ --- β”‚\n", "β”‚ ┆ ┆ f64 ┆ f64 ┆ ┆ ┆ i64 ┆ i64 ┆ i64 β”‚\n", "β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ═════════════β•ͺ═════════════β•ͺ═══β•ͺ═══════════β•ͺ═════════════β•ͺ═════════════β•ͺ════════════║\n", "β”‚ 513 ┆ 56 ┆ -3.690542 ┆ 40.423739 ┆ … ┆ 0 ┆ 3 ┆ 12 ┆ 473 β”‚\n", "β”‚ 521 ┆ 56 ┆ -3.689019 ┆ 40.429011 ┆ … ┆ 0 ┆ 3 ┆ 12 ┆ 313 β”‚\n", "β”‚ 2549 ┆ 56 ┆ -3.691608 ┆ 40.421366 ┆ … ┆ 0 ┆ null ┆ null ┆ 547 β”‚\n", "β”‚ 2561 ┆ 56 ┆ -3.698411 ┆ 40.418013 ┆ … ┆ 0 ┆ null ┆ null ┆ 1080 β”‚\n", "β”‚ 5571 ┆ 56 ┆ -3.687918 ┆ 40.43285 ┆ … ┆ 0 ┆ null ┆ null ┆ 197 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(cs.numeric()).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variables categΓ³ricas" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 8)
datedatetimelinedestinationStartTimeStopTimedayTypestrike
strstrstrstrstrstrstrstr
"2024-03-13""2024-03-13 08:…"27""PLAZA CASTILLA…"05:55""23:30""LA""N"
"2024-03-13""2024-03-13 08:…"27""PLAZA CASTILLA…"05:55""23:30""LA""N"
"2024-03-13""2024-03-13 08:…"150""VIRGEN CORTIJO…nullnullnullnull
"2024-03-13""2024-03-13 08:…"150""VIRGEN CORTIJO…nullnullnullnull
"2024-03-13""2024-03-13 08:…"14""PIO XII"nullnullnullnull
" ], "text/plain": [ "shape: (5, 8)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ date ┆ datetime ┆ line ┆ destination ┆ StartTime ┆ StopTime ┆ dayType ┆ strike β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- β”‚\n", "β”‚ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════════════════β•ͺ══════β•ͺ════════════════β•ͺ═══════════β•ͺ══════════β•ͺ═════════β•ͺ════════║\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 27 ┆ PLAZA CASTILLA ┆ 05:55 ┆ 23:30 ┆ LA ┆ N β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 27 ┆ PLAZA CASTILLA ┆ 05:55 ┆ 23:30 ┆ LA ┆ N β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 150 ┆ VIRGEN CORTIJO ┆ null ┆ null ┆ null ┆ null β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 150 ┆ VIRGEN CORTIJO ┆ null ┆ null ┆ null ┆ null β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 14 ┆ PIO XII ┆ null ┆ null ┆ null ┆ null β”‚\n", "β”‚ ┆ 08:00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(cs.string()).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocesamiento previo \n", "### Creamos la PK" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 20)
datedatetimebuslinestoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdestinationdeviationStartTimeStopTimeMinimunFrequencyMaximumFrequencyisHeaddayTypestrikeestimateArrivePK
strstri64stri64f64f64i64i64stri64strstri64i64boolstrstri64str
"2024-03-13""2024-03-13 08:…513"27"56-3.69054240.42373901841"PLAZA CASTILLA…0"05:55""23:30"312false"LA""N"473"2024-03-13 08:…
"2024-03-13""2024-03-13 08:…521"27"56-3.68901940.42901101221"PLAZA CASTILLA…0"05:55""23:30"312false"LA""N"313"2024-03-13 08:…
"2024-03-13""2024-03-13 08:…2549"150"56-3.69160840.42136602081"VIRGEN CORTIJO…0nullnullnullnullfalsenullnull547"2024-03-13 08:…
"2024-03-13""2024-03-13 08:…2561"150"56-3.69841140.41801302779"VIRGEN CORTIJO…0nullnullnullnullfalsenullnull1080"2024-03-13 08:…
"2024-03-13""2024-03-13 08:…5571"14"56-3.68791840.432850679"PIO XII"0nullnullnullnullfalsenullnull197"2024-03-13 08:…
" ], "text/plain": [ "shape: (5, 20)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ date ┆ datetime ┆ bus ┆ line ┆ … ┆ dayType ┆ strike ┆ estimateArriv ┆ PK β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ e ┆ --- β”‚\n", "β”‚ str ┆ str ┆ i64 ┆ str ┆ ┆ str ┆ str ┆ --- ┆ str β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ i64 ┆ β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════β•ͺ══════β•ͺ══════β•ͺ═══β•ͺ═════════β•ͺ════════β•ͺ═══════════════β•ͺ═══════════════║\n", "β”‚ 2024-03-13 ┆ 2024-03-13 08: ┆ 513 ┆ 27 ┆ … ┆ LA ┆ N ┆ 473 ┆ 2024-03-13 β”‚\n", "β”‚ ┆ 00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ 08:00:01.7653 β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 17_B513_… β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 08: ┆ 521 ┆ 27 ┆ … ┆ LA ┆ N ┆ 313 ┆ 2024-03-13 β”‚\n", "β”‚ ┆ 00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ 08:00:01.7653 β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 17_B521_… β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 08: ┆ 2549 ┆ 150 ┆ … ┆ null ┆ null ┆ 547 ┆ 2024-03-13 β”‚\n", "β”‚ ┆ 00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ 08:00:01.7653 β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 17_B2549… β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 08: ┆ 2561 ┆ 150 ┆ … ┆ null ┆ null ┆ 1080 ┆ 2024-03-13 β”‚\n", "β”‚ ┆ 00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ 08:00:01.7653 β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 17_B2561… β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 08: ┆ 5571 ┆ 14 ┆ … ┆ null ┆ null ┆ 197 ┆ 2024-03-13 β”‚\n", "β”‚ ┆ 00:01.765317 ┆ ┆ ┆ ┆ ┆ ┆ ┆ 08:00:01.7653 β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 17_B5571… β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data = sample_data.with_columns((pl.col('datetime').cast(pl.String)+\"_B\"+pl.col('bus').cast(pl.String)+\"_L\"+ pl.col('line').cast(pl.String)+\"_S\"+pl.col('stop').cast(pl.String)).alias('PK'))\n", "sample_data.head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**PK no ΓΊnica**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
PKcount
stru32
"2024-03-13 22:…2
"2024-03-13 22:…2
"2024-03-13 22:…2
"2024-03-13 22:…2
"2024-03-13 22:…2
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ PK ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 2024-03-13 22:57:54.826278_B5738… ┆ 2 β”‚\n", "β”‚ 2024-03-13 22:58:02.761728_B2260… ┆ 2 β”‚\n", "β”‚ 2024-03-13 22:39:53.268162_B5738… ┆ 2 β”‚\n", "β”‚ 2024-03-13 22:53:53.748548_B5738… ┆ 2 β”‚\n", "β”‚ 2024-03-13 22:54:57.597581_B5413… ┆ 2 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('PK')).count().sort('count',descending = True).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Eliminamos los valores incorrectos `estimateArrive`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
estimateArrivecount
i64u32
9999992558
888888753
53891
53431
53391
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ estimateArrive ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 999999 ┆ 2558 β”‚\n", "β”‚ 888888 ┆ 753 β”‚\n", "β”‚ 5389 ┆ 1 β”‚\n", "β”‚ 5343 ┆ 1 β”‚\n", "β”‚ 5339 ┆ 1 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('estimateArrive')).count().sort('estimateArrive',descending=True).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filtramos por un ETA < 88888 ya que es un valor de tiempo errΓ³neo" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.filter(pl.col('estimateArrive')<888888)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comprobamos que la PK es ΓΊnica**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
PKcount
stru32
"2024-03-13 22:…2
"2024-03-13 10:…2
"2024-03-13 10:…2
"2024-03-13 10:…2
"2024-03-13 10:…2
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ PK ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 2024-03-13 22:56:55.785282_B2263… ┆ 2 β”‚\n", "β”‚ 2024-03-13 10:36:31.407876_B2544… ┆ 2 β”‚\n", "β”‚ 2024-03-13 10:37:05.685830_B2544… ┆ 2 β”‚\n", "β”‚ 2024-03-13 10:55:23.770393_B2539… ┆ 2 β”‚\n", "β”‚ 2024-03-13 10:35:57.665293_B2544… ┆ 2 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('PK')).count().sort('count',descending = True).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sigue habiendo datos duplicados, por lo que para esos casos, cogeremos el que tenga menor ETA" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.group_by('PK').min()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
PKcount
stru32
"2024-03-13 15:…1
"2024-03-13 17:…1
"2024-03-13 17:…1
"2024-03-13 12:…1
"2024-03-13 14:…1
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ PK ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 2024-03-13 15:28:07.888841_B4840… ┆ 1 β”‚\n", "β”‚ 2024-03-13 17:50:57.960038_B2260… ┆ 1 β”‚\n", "β”‚ 2024-03-13 17:14:02.830113_B4965… ┆ 1 β”‚\n", "β”‚ 2024-03-13 12:19:58.867713_B8846… ┆ 1 β”‚\n", "β”‚ 2024-03-13 14:27:27.845018_B4723… ┆ 1 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by('PK').count().sort('count',descending=True).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matriz de confusiΓ³n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pasamos `datetime` a formato fecha**" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.with_columns(pl.col('datetime').map_elements(lambda x: datetime.strptime(x, \"%Y-%m-%d %H:%M:%S.%f\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pasamos `date` a formato fecha**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.with_columns(pl.col(\"date\").cast(pl.Date))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Pasamos la variable `isHead` a numΓ©rica**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.with_columns(pl.col('isHead').cast(pl.UInt8))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 20)
PKdatedatetimebuslinestoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdestinationdeviationStartTimeStopTimeMinimunFrequencyMaximumFrequencyisHeaddayTypestrikeestimateArrive
strdatedatetime[ΞΌs]i64stri64f64f64i64i64stri64strstri64i64u8strstri64
"2024-03-13 13:…2024-03-132024-03-13 13:57:02.9217085575"14"54-3.69213340.42017202227"PIO XII"0nullnullnullnull0nullnull844
"2024-03-13 21:…2024-03-132024-03-13 21:53:03.3318145667"40"52-3.69702840.43239804060"ALFONSO XIII"0nullnullnullnull0nullnull1406
"2024-03-13 13:…2024-03-132024-03-13 13:41:03.3860374793"45"72-3.69474340.39275102869"REINA VICTORIA…0nullnullnullnull0nullnull1085
"2024-03-13 13:…2024-03-132024-03-13 13:56:54.781069521"27"54-3.68954840.437140612"PLAZA CASTILLA…0"05:55""23:30"3120"LA""N"194
"2024-03-13 18:…2024-03-132024-03-13 18:24:05.9960582468"134"4968-3.71393740.48317901207"MONTECARMELO"0"06:25""23:45"9260"LA""N"283
" ], "text/plain": [ "shape: (5, 20)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ PK ┆ date ┆ datetime ┆ bus ┆ … ┆ isHead ┆ dayType ┆ strike ┆ estimateArri β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ ve β”‚\n", "β”‚ str ┆ date ┆ datetime[ΞΌs] ┆ i64 ┆ ┆ u8 ┆ str ┆ str ┆ --- β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ i64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════β•ͺ═══════════════β•ͺ══════β•ͺ═══β•ͺ════════β•ͺ═════════β•ͺ════════β•ͺ══════════════║\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 5575 ┆ … ┆ 0 ┆ null ┆ null ┆ 844 β”‚\n", "β”‚ 13:57:02.9217 ┆ ┆ 13:57:02.9217 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 08_B5575… ┆ ┆ 08 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 5667 ┆ … ┆ 0 ┆ null ┆ null ┆ 1406 β”‚\n", "β”‚ 21:53:03.3318 ┆ ┆ 21:53:03.3318 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 14_B5667… ┆ ┆ 14 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 4793 ┆ … ┆ 0 ┆ null ┆ null ┆ 1085 β”‚\n", "β”‚ 13:41:03.3860 ┆ ┆ 13:41:03.3860 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 37_B4793… ┆ ┆ 37 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 521 ┆ … ┆ 0 ┆ LA ┆ N ┆ 194 β”‚\n", "β”‚ 13:56:54.7810 ┆ ┆ 13:56:54.7810 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 69_B521_… ┆ ┆ 69 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 2468 ┆ … ┆ 0 ┆ LA ┆ N ┆ 283 β”‚\n", "β”‚ 18:24:05.9960 ┆ ┆ 18:24:05.9960 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 58_B2468… ┆ ┆ 58 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.head().collect()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "numeric_var = ['bus', 'line','stop',\n", " 'positionBusLon',\n", " 'positionBusLat',\n", " 'DistanceBus',\n", " 'deviation',\n", " 'MinimunFrequency',\n", " 'MaximumFrequency',\n", " 'isHead',\n", " 'estimateArrive',]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "sample_data_df = sample_data.collect()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "matrix = [[0] * len(numeric_var) for _ in range(len(numeric_var))]\n", "for i in range(0,len(numeric_var)):\n", " for j in range(0,len(numeric_var)):\n", " matrix[i][j] = sample_data_df.select(pl.corr(numeric_var[i],numeric_var[j])).item()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "matrix_corr = pd.DataFrame(matrix, columns=numeric_var,index=numeric_var).apply(lambda col: col.round(2))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
buslinestoppositionBusLonpositionBusLatDistanceBusdeviationMinimunFrequencyMaximumFrequencyisHeadestimateArrive
bus1.000.040.110.27-0.270.000.090.100.280.010.07
line0.041.000.24-0.080.090.270.010.430.390.080.12
stop0.110.241.000.08-0.070.13-0.020.010.070.260.06
positionBusLon0.27-0.080.081.00-1.000.12-0.010.02-0.02-0.020.11
positionBusLat-0.270.09-0.07-1.001.00-0.110.010.020.020.02-0.10
DistanceBus0.000.270.130.12-0.111.000.000.240.22-0.060.86
deviation0.090.01-0.02-0.010.010.001.00-0.00-0.00-0.020.02
MinimunFrequency0.100.430.010.020.020.24-0.001.000.780.030.30
MaximumFrequency0.280.390.07-0.020.020.22-0.000.781.000.040.26
isHead0.010.080.26-0.020.02-0.06-0.020.030.041.000.02
estimateArrive0.070.120.060.11-0.100.860.020.300.260.021.00
\n", "
" ], "text/plain": [ " bus line stop positionBusLon positionBusLat \\\n", "bus 1.00 0.04 0.11 0.27 -0.27 \n", "line 0.04 1.00 0.24 -0.08 0.09 \n", "stop 0.11 0.24 1.00 0.08 -0.07 \n", "positionBusLon 0.27 -0.08 0.08 1.00 -1.00 \n", "positionBusLat -0.27 0.09 -0.07 -1.00 1.00 \n", "DistanceBus 0.00 0.27 0.13 0.12 -0.11 \n", "deviation 0.09 0.01 -0.02 -0.01 0.01 \n", "MinimunFrequency 0.10 0.43 0.01 0.02 0.02 \n", "MaximumFrequency 0.28 0.39 0.07 -0.02 0.02 \n", "isHead 0.01 0.08 0.26 -0.02 0.02 \n", "estimateArrive 0.07 0.12 0.06 0.11 -0.10 \n", "\n", " DistanceBus deviation MinimunFrequency MaximumFrequency \\\n", "bus 0.00 0.09 0.10 0.28 \n", "line 0.27 0.01 0.43 0.39 \n", "stop 0.13 -0.02 0.01 0.07 \n", "positionBusLon 0.12 -0.01 0.02 -0.02 \n", "positionBusLat -0.11 0.01 0.02 0.02 \n", "DistanceBus 1.00 0.00 0.24 0.22 \n", "deviation 0.00 1.00 -0.00 -0.00 \n", "MinimunFrequency 0.24 -0.00 1.00 0.78 \n", "MaximumFrequency 0.22 -0.00 0.78 1.00 \n", "isHead -0.06 -0.02 0.03 0.04 \n", "estimateArrive 0.86 0.02 0.30 0.26 \n", "\n", " isHead estimateArrive \n", "bus 0.01 0.07 \n", "line 0.08 0.12 \n", "stop 0.26 0.06 \n", "positionBusLon -0.02 0.11 \n", "positionBusLat 0.02 -0.10 \n", "DistanceBus -0.06 0.86 \n", "deviation -0.02 0.02 \n", "MinimunFrequency 0.03 0.30 \n", "MaximumFrequency 0.04 0.26 \n", "isHead 1.00 0.02 \n", "estimateArrive 0.02 1.00 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix_corr" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.heatmap(matrix_corr, cmap='coolwarm',vmin=-1, vmax=1,annot= True)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "matrix_corr.to_csv('/home/mlia/proyectos/data-generation/docs/notebooks/aux/matrix_corr_emt.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obtenemos que las variables mΓ‘s correladas entre sΓ­ son:\n", "- `positionBusLon` y `positionBusLat`\n", "- `MinimunFrequency` y `MaximumFrequency`\n", "\n", "AdemΓ‘s la variable mΓ‘s correlada con `estimateArrive` es `DistanceBus`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estudio por variables\n", "\n", "Para cada variable:\n", "1. Comprobaremos si tiene valores nulos \n", "2. Calcularemos la correlaciΓ³n con la variable `estimateArrive`\n", "3. Dibujaremos el tiempo medio de llegada segΓΊn sus categorΓ­as en el caso en el que fuera posible\n", "4. Decidiremos si la mantenemos o la eliminamos" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 21)
statisticPKdatedatetimebuslinestoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdestinationdeviationStartTimeStopTimeMinimunFrequencyMaximumFrequencyisHeaddayTypestrikeestimateArrive
strstrstrstrf64strf64f64f64f64f64strf64strstrf64f64f64strstrf64
"count""5""5""5"5.0"5"5.05.05.05.05.0"5"5.0"4""4"4.04.05.0"4""4"5.0
"null_count""0""0""0"0.0"0"0.00.00.00.00.0"0"0.0"1""1"1.01.00.0"1""1"0.0
"mean"null"2024-03-13"null2826.6null2255.0-3.67958440.4642480.03542.0null0.0nullnull5.7520.250.0nullnull742.2
"std"nullnullnull3510.668711null2304.7006310.0152010.0342550.02000.683633null0.0nullnull1.2583062.3629080.0nullnull544.644104
"min""2024-03-13 11:…"2024-03-13"null122.0"174"78.0-3.70203740.4051060.0339.0"ALSACIA"0.0"05:30""23:45"4.017.00.0"LA""N"76.0
"25%"null"2024-03-13"null584.0null246.0-3.68578240.4649950.03195.0null0.0nullnull6.020.00.0nullnull281.0
"50%"null"2024-03-13"null2071.0null1762.0-3.67851940.4782970.03891.0null0.0nullnull6.022.00.0nullnull860.0
"75%"null"2024-03-13"null2506.0null3794.0-3.66605840.4837160.04703.0null0.0nullnull6.022.00.0nullnull1190.0
"max""2024-03-13 22:…"2024-03-13"null8850.0"C03"5395.0-3.66552540.4891270.05582.0"VALDEBEBAS"0.0"06:00""23:45"7.022.00.0"LA""N"1304.0
" ], "text/plain": [ "shape: (9, 21)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ statistic ┆ PK ┆ date ┆ datetime ┆ … ┆ isHead ┆ dayType ┆ strike ┆ estimateArri β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ ve β”‚\n", "β”‚ str ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ str ┆ str ┆ --- β”‚\n", "β”‚ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ════════════β•ͺ══════════β•ͺ═══β•ͺ════════β•ͺ═════════β•ͺ════════β•ͺ══════════════║\n", "β”‚ count ┆ 5 ┆ 5 ┆ 5 ┆ … ┆ 5.0 ┆ 4 ┆ 4 ┆ 5.0 β”‚\n", "β”‚ null_count ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 0.0 ┆ 1 ┆ 1 ┆ 0.0 β”‚\n", "β”‚ mean ┆ null ┆ 2024-03-13 ┆ null ┆ … ┆ 0.0 ┆ null ┆ null ┆ 742.2 β”‚\n", "β”‚ std ┆ null ┆ null ┆ null ┆ … ┆ 0.0 ┆ null ┆ null ┆ 544.644104 β”‚\n", "β”‚ min ┆ 2024-03-13 ┆ 2024-03-13 ┆ null ┆ … ┆ 0.0 ┆ LA ┆ N ┆ 76.0 β”‚\n", "β”‚ ┆ 11:11:06.651 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ ┆ 553_B2071… ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 25% ┆ null ┆ 2024-03-13 ┆ null ┆ … ┆ 0.0 ┆ null ┆ null ┆ 281.0 β”‚\n", "β”‚ 50% ┆ null ┆ 2024-03-13 ┆ null ┆ … ┆ 0.0 ┆ null ┆ null ┆ 860.0 β”‚\n", "β”‚ 75% ┆ null ┆ 2024-03-13 ┆ null ┆ … ┆ 0.0 ┆ null ┆ null ┆ 1190.0 β”‚\n", "β”‚ max ┆ 2024-03-13 ┆ 2024-03-13 ┆ null ┆ … ┆ 0.0 ┆ LA ┆ N ┆ 1304.0 β”‚\n", "β”‚ ┆ 22:04:04.171 ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ ┆ 556_B2506… ┆ ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.head().describe() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `date`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 2)
datecount
dateu32
2024-03-131131139
" ], "text/plain": [ "shape: (1, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ date ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ date ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════║\n", "β”‚ 2024-03-13 ┆ 1131139 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('date')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `datetime`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
datetimecount
datetime[ΞΌs]u32
2024-03-13 11:37:57.3212672
2024-03-13 13:26:54.20861212
2024-03-13 09:59:29.1406724
2024-03-13 10:52:58.6649174
2024-03-13 21:01:54.7486382
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ datetime ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ datetime[ΞΌs] ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 2024-03-13 11:37:57.321267 ┆ 2 β”‚\n", "β”‚ 2024-03-13 13:26:54.208612 ┆ 12 β”‚\n", "β”‚ 2024-03-13 09:59:29.140672 ┆ 4 β”‚\n", "β”‚ 2024-03-13 10:52:58.664917 ┆ 4 β”‚\n", "β”‚ 2024-03-13 21:01:54.748638 ┆ 2 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('datetime')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `bus`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
buscount
i64u32
33171260
432959
2266515
475817
548229
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ bus ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 3317 ┆ 1260 β”‚\n", "β”‚ 4329 ┆ 59 β”‚\n", "β”‚ 2266 ┆ 515 β”‚\n", "β”‚ 4758 ┆ 17 β”‚\n", "β”‚ 548 ┆ 229 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('bus')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
bus
f64
0.06748
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ bus β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.06748 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('bus','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tiempo medio de espera segΓΊn autobΓΊs" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
busestimateArrive
i64f64
4755398.504202
25041082.137097
5654380.661765
560442.948276
563528.829897
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ bus ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•ͺ════════════════║\n", "β”‚ 4755 ┆ 398.504202 β”‚\n", "β”‚ 2504 ┆ 1082.137097 β”‚\n", "β”‚ 5654 ┆ 380.661765 β”‚\n", "β”‚ 560 ┆ 442.948276 β”‚\n", "β”‚ 563 ┆ 528.829897 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('bus')).mean().select(pl.col('bus'),pl.col('estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SegΓΊn que autobΓΊs sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creaciΓ³n de nuestro modelo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `line`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
linecount
stru32
"167"3352
"67"46138
"48"13481
"173"23933
"C2"1650
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ line ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 167 ┆ 3352 β”‚\n", "β”‚ 67 ┆ 46138 β”‚\n", "β”‚ 48 ┆ 13481 β”‚\n", "β”‚ 173 ┆ 23933 β”‚\n", "β”‚ C2 ┆ 1650 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('line')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
line
f64
0.120876
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ line β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.120876 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('line','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tiempo medio de espera segΓΊn lΓ­nea" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "application/javascript": "(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n var py_version = '3.4.0'.replace('rc', '-rc.').replace('.dev', '-dev.');\n var reloading = false;\n var Bokeh = root.Bokeh;\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks;\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, js_modules, js_exports, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n if (js_modules == null) js_modules = [];\n if (js_exports == null) js_exports = {};\n\n root._bokeh_onload_callbacks.push(callback);\n\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls.length === 0 && js_modules.length === 0 && Object.keys(js_exports).length === 0) {\n run_callbacks();\n return null;\n }\n if (!reloading) {\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n }\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n window._bokeh_on_load = on_load\n\n function on_error() {\n console.error(\"failed to load \" + url);\n }\n\n var skip = [];\n if (window.requirejs) {\n window.requirejs.config({'packages': {}, 'paths': {}, 'shim': {}});\n root._bokeh_is_loading = css_urls.length + 0;\n } else {\n root._bokeh_is_loading = css_urls.length + js_urls.length + js_modules.length + Object.keys(js_exports).length;\n }\n\n var existing_stylesheets = []\n var links = document.getElementsByTagName('link')\n for (var i = 0; i < links.length; i++) {\n var link = links[i]\n if (link.href != null) {\n\texisting_stylesheets.push(link.href)\n }\n }\n for (var i = 0; i < css_urls.length; i++) {\n var url = css_urls[i];\n if (existing_stylesheets.indexOf(url) !== -1) {\n\ton_load()\n\tcontinue;\n }\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n } var existing_scripts = []\n var scripts = document.getElementsByTagName('script')\n for (var i = 0; i < scripts.length; i++) {\n var script = scripts[i]\n if (script.src != null) {\n\texisting_scripts.push(script.src)\n }\n }\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n if (skip.indexOf(url) !== -1 || existing_scripts.indexOf(url) !== -1) {\n\tif (!window.requirejs) {\n\t on_load();\n\t}\n\tcontinue;\n }\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n for (var i = 0; i < js_modules.length; i++) {\n var url = js_modules[i];\n if (skip.indexOf(url) !== -1 || existing_scripts.indexOf(url) !== -1) {\n\tif (!window.requirejs) {\n\t on_load();\n\t}\n\tcontinue;\n }\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n element.type = \"module\";\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n for (const name in js_exports) {\n var url = js_exports[name];\n if (skip.indexOf(url) >= 0 || root[name] != null) {\n\tif (!window.requirejs) {\n\t on_load();\n\t}\n\tcontinue;\n }\n var element = document.createElement('script');\n element.onerror = on_error;\n element.async = false;\n element.type = \"module\";\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n element.textContent = `\n import ${name} from \"${url}\"\n window.${name} = ${name}\n window._bokeh_on_load()\n `\n document.head.appendChild(element);\n }\n if (!js_urls.length && !js_modules.length) {\n on_load()\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.4.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.4.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.4.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.4.0.min.js\", \"https://cdn.holoviz.org/panel/1.4.0/dist/panel.min.js\"];\n var js_modules = [];\n var js_exports = {};\n var css_urls = [];\n var inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {} // ensure no trailing comma for IE\n ];\n\n function run_inline_js() {\n if ((root.Bokeh !== undefined) || (force === true)) {\n for (var i = 0; i < inline_js.length; i++) {\n\ttry {\n inline_js[i].call(root, root.Bokeh);\n\t} catch(e) {\n\t if (!reloading) {\n\t throw e;\n\t }\n\t}\n }\n // Cache old bokeh versions\n if (Bokeh != undefined && !reloading) {\n\tvar NewBokeh = root.Bokeh;\n\tif (Bokeh.versions === undefined) {\n\t Bokeh.versions = new Map();\n\t}\n\tif (NewBokeh.version !== Bokeh.version) {\n\t Bokeh.versions.set(NewBokeh.version, NewBokeh)\n\t}\n\troot.Bokeh = Bokeh;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n }\n root._bokeh_is_initializing = false\n }\n\n function load_or_wait() {\n // Implement a backoff loop that tries to ensure we do not load multiple\n // versions of Bokeh and its dependencies at the same time.\n // In recent versions we use the root._bokeh_is_initializing flag\n // to determine whether there is an ongoing attempt to initialize\n // bokeh, however for backward compatibility we also try to ensure\n // that we do not start loading a newer (Panel>=1.0 and Bokeh>3) version\n // before older versions are fully initialized.\n if (root._bokeh_is_initializing && Date.now() > root._bokeh_timeout) {\n root._bokeh_is_initializing = false;\n root._bokeh_onload_callbacks = undefined;\n console.log(\"Bokeh: BokehJS was loaded multiple times but one version failed to initialize.\");\n load_or_wait();\n } else if (root._bokeh_is_initializing || (typeof root._bokeh_is_initializing === \"undefined\" && root._bokeh_onload_callbacks !== undefined)) {\n setTimeout(load_or_wait, 100);\n } else {\n root._bokeh_is_initializing = true\n root._bokeh_onload_callbacks = []\n var bokeh_loaded = Bokeh != null && (Bokeh.version === py_version || (Bokeh.versions !== undefined && Bokeh.versions.has(py_version)));\n if (!reloading && !bokeh_loaded) {\n\troot.Bokeh = undefined;\n }\n load_libs(css_urls, js_urls, js_modules, js_exports, function() {\n\tconsole.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n\trun_inline_js();\n });\n }\n }\n // Give older versions of the autoload script a head-start to ensure\n // they initialize before we start loading newer version.\n setTimeout(load_or_wait, 100)\n}(window));", "application/vnd.holoviews_load.v0+json": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": "\nif ((window.PyViz === undefined) || (window.PyViz instanceof HTMLElement)) {\n window.PyViz = {comms: {}, comm_status:{}, kernels:{}, receivers: {}, plot_index: []}\n}\n\n\n function JupyterCommManager() {\n }\n\n JupyterCommManager.prototype.register_target = function(plot_id, comm_id, msg_handler) {\n if (window.comm_manager || ((window.Jupyter !== undefined) && (Jupyter.notebook.kernel != null))) {\n var comm_manager = window.comm_manager || Jupyter.notebook.kernel.comm_manager;\n comm_manager.register_target(comm_id, function(comm) {\n comm.on_msg(msg_handler);\n });\n } else if ((plot_id in window.PyViz.kernels) && (window.PyViz.kernels[plot_id])) {\n window.PyViz.kernels[plot_id].registerCommTarget(comm_id, function(comm) {\n comm.onMsg = msg_handler;\n });\n } else if (typeof google != 'undefined' && google.colab.kernel != null) {\n google.colab.kernel.comms.registerTarget(comm_id, (comm) => {\n var messages = comm.messages[Symbol.asyncIterator]();\n function processIteratorResult(result) {\n var message = result.value;\n console.log(message)\n var content = {data: message.data, comm_id};\n var buffers = []\n for (var buffer of message.buffers || []) {\n buffers.push(new DataView(buffer))\n }\n var metadata = message.metadata || {};\n var msg = {content, buffers, metadata}\n msg_handler(msg);\n return messages.next().then(processIteratorResult);\n }\n return messages.next().then(processIteratorResult);\n })\n }\n }\n\n JupyterCommManager.prototype.get_client_comm = function(plot_id, comm_id, msg_handler) {\n if (comm_id in window.PyViz.comms) {\n return window.PyViz.comms[comm_id];\n } else if (window.comm_manager || ((window.Jupyter !== undefined) && (Jupyter.notebook.kernel != null))) {\n var comm_manager = window.comm_manager || Jupyter.notebook.kernel.comm_manager;\n var comm = comm_manager.new_comm(comm_id, {}, {}, {}, comm_id);\n if (msg_handler) {\n comm.on_msg(msg_handler);\n }\n } else if ((plot_id in window.PyViz.kernels) && (window.PyViz.kernels[plot_id])) {\n var comm = window.PyViz.kernels[plot_id].connectToComm(comm_id);\n comm.open();\n if (msg_handler) {\n comm.onMsg = msg_handler;\n }\n } else if (typeof google != 'undefined' && google.colab.kernel != null) {\n var comm_promise = google.colab.kernel.comms.open(comm_id)\n comm_promise.then((comm) => {\n window.PyViz.comms[comm_id] = comm;\n if (msg_handler) {\n var messages = comm.messages[Symbol.asyncIterator]();\n function processIteratorResult(result) {\n var message = result.value;\n var content = {data: message.data};\n var metadata = message.metadata || {comm_id};\n var msg = {content, metadata}\n msg_handler(msg);\n return messages.next().then(processIteratorResult);\n }\n return messages.next().then(processIteratorResult);\n }\n }) \n var sendClosure = (data, metadata, buffers, disposeOnDone) => {\n return comm_promise.then((comm) => {\n comm.send(data, metadata, buffers, disposeOnDone);\n });\n };\n var comm = {\n send: sendClosure\n };\n }\n window.PyViz.comms[comm_id] = comm;\n return comm;\n }\n window.PyViz.comm_manager = new JupyterCommManager();\n \n\n\nvar JS_MIME_TYPE = 'application/javascript';\nvar HTML_MIME_TYPE = 'text/html';\nvar EXEC_MIME_TYPE = 'application/vnd.holoviews_exec.v0+json';\nvar CLASS_NAME = 'output';\n\n/**\n * Render data to the DOM node\n */\nfunction render(props, node) {\n var div = document.createElement(\"div\");\n var script = document.createElement(\"script\");\n node.appendChild(div);\n node.appendChild(script);\n}\n\n/**\n * Handle when a new output is added\n */\nfunction handle_add_output(event, handle) {\n var output_area = handle.output_area;\n var output = handle.output;\n if ((output.data == undefined) || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n return\n }\n var id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n if (id !== undefined) {\n var nchildren = toinsert.length;\n var html_node = toinsert[nchildren-1].children[0];\n html_node.innerHTML = output.data[HTML_MIME_TYPE];\n var scripts = [];\n var nodelist = html_node.querySelectorAll(\"script\");\n for (var i in nodelist) {\n if (nodelist.hasOwnProperty(i)) {\n scripts.push(nodelist[i])\n }\n }\n\n scripts.forEach( function (oldScript) {\n var newScript = document.createElement(\"script\");\n var attrs = [];\n var nodemap = oldScript.attributes;\n for (var j in nodemap) {\n if (nodemap.hasOwnProperty(j)) {\n attrs.push(nodemap[j])\n }\n }\n attrs.forEach(function(attr) { newScript.setAttribute(attr.name, attr.value) });\n newScript.appendChild(document.createTextNode(oldScript.innerHTML));\n oldScript.parentNode.replaceChild(newScript, oldScript);\n });\n if (JS_MIME_TYPE in output.data) {\n toinsert[nchildren-1].children[1].textContent = output.data[JS_MIME_TYPE];\n }\n output_area._hv_plot_id = id;\n if ((window.Bokeh !== undefined) && (id in Bokeh.index)) {\n window.PyViz.plot_index[id] = Bokeh.index[id];\n } else {\n window.PyViz.plot_index[id] = null;\n }\n } else if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n var bk_div = document.createElement(\"div\");\n bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n var script_attrs = bk_div.children[0].attributes;\n for (var i = 0; i < script_attrs.length; i++) {\n toinsert[toinsert.length - 1].childNodes[1].setAttribute(script_attrs[i].name, script_attrs[i].value);\n }\n // store reference to server id on output_area\n output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n }\n}\n\n/**\n * Handle when an output is cleared or removed\n */\nfunction handle_clear_output(event, handle) {\n var id = handle.cell.output_area._hv_plot_id;\n var server_id = handle.cell.output_area._bokeh_server_id;\n if (((id === undefined) || !(id in PyViz.plot_index)) && (server_id !== undefined)) { return; }\n var comm = window.PyViz.comm_manager.get_client_comm(\"hv-extension-comm\", \"hv-extension-comm\", function () {});\n if (server_id !== null) {\n comm.send({event_type: 'server_delete', 'id': server_id});\n return;\n } else if (comm !== null) {\n comm.send({event_type: 'delete', 'id': id});\n }\n delete PyViz.plot_index[id];\n if ((window.Bokeh !== undefined) & (id in window.Bokeh.index)) {\n var doc = window.Bokeh.index[id].model.document\n doc.clear();\n const i = window.Bokeh.documents.indexOf(doc);\n if (i > -1) {\n window.Bokeh.documents.splice(i, 1);\n }\n }\n}\n\n/**\n * Handle kernel restart event\n */\nfunction handle_kernel_cleanup(event, handle) {\n delete PyViz.comms[\"hv-extension-comm\"];\n window.PyViz.plot_index = {}\n}\n\n/**\n * Handle update_display_data messages\n */\nfunction handle_update_output(event, handle) {\n handle_clear_output(event, {cell: {output_area: handle.output_area}})\n handle_add_output(event, handle)\n}\n\nfunction register_renderer(events, OutputArea) {\n function append_mime(data, metadata, element) {\n // create a DOM node to render to\n var toinsert = this.create_output_subarea(\n metadata,\n CLASS_NAME,\n EXEC_MIME_TYPE\n );\n this.keyboard_manager.register_events(toinsert);\n // Render to node\n var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n render(props, toinsert[0]);\n element.append(toinsert);\n return toinsert\n }\n\n events.on('output_added.OutputArea', handle_add_output);\n events.on('output_updated.OutputArea', handle_update_output);\n events.on('clear_output.CodeCell', handle_clear_output);\n events.on('delete.Cell', handle_clear_output);\n events.on('kernel_ready.Kernel', handle_kernel_cleanup);\n\n OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n safe: true,\n index: 0\n });\n}\n\nif (window.Jupyter !== undefined) {\n try {\n var events = require('base/js/events');\n var OutputArea = require('notebook/js/outputarea').OutputArea;\n if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n register_renderer(events, OutputArea);\n }\n } catch(err) {\n }\n}\n", "application/vnd.holoviews_load.v0+json": "" }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ] }, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1002" } }, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Bars [line] (estimateArrive)" ] }, "execution_count": 33, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1004" } }, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('line')).mean().select(pl.col('line'),pl.col('estimateArrive')).collect().plot.bar(x='line',rot=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SegΓΊn que lΓ­nea sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creaciΓ³n de nuestro modelo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `stop`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
stopcount
i64u32
17453446
59193435
58031707
306801
2711721
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ stop ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 1745 ┆ 3446 β”‚\n", "β”‚ 5919 ┆ 3435 β”‚\n", "β”‚ 5803 ┆ 1707 β”‚\n", "β”‚ 30 ┆ 6801 β”‚\n", "β”‚ 271 ┆ 1721 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('stop')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
stop
f64
0.063389
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ stop β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.063389 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('stop','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tiempo medio de espera segΓΊn parada" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
stopestimateArrive
i64f64
5800804.162099
4493711.560164
1608605.882045
5919773.480058
5803786.485647
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ stop ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•ͺ════════════════║\n", "β”‚ 5800 ┆ 804.162099 β”‚\n", "β”‚ 4493 ┆ 711.560164 β”‚\n", "β”‚ 1608 ┆ 605.882045 β”‚\n", "β”‚ 5919 ┆ 773.480058 β”‚\n", "β”‚ 5803 ┆ 786.485647 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('stop')).mean().select(pl.col('stop'),pl.col('estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SegΓΊn que parada sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creaciΓ³n de nuestro modelo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `positionBusLon`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
positionBusLoncount
f64u32
-3.6979793
-3.6471531
-3.63477
-3.7111682
-3.71340123
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ positionBusLon ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ f64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ -3.697979 ┆ 3 β”‚\n", "β”‚ -3.647153 ┆ 1 β”‚\n", "β”‚ -3.6347 ┆ 7 β”‚\n", "β”‚ -3.711168 ┆ 2 β”‚\n", "β”‚ -3.713401 ┆ 23 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('positionBusLon')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
positionBusLon
f64
0.10522
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ positionBusLon β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.10522 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('positionBusLon','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `positionBusLat`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
positionBusLatcount
f64u32
40.4818689
40.46763917
40.4241762
40.46955124
40.4756469
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ positionBusLat ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ f64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 40.481868 ┆ 9 β”‚\n", "β”‚ 40.467639 ┆ 17 β”‚\n", "β”‚ 40.424176 ┆ 2 β”‚\n", "β”‚ 40.469551 ┆ 24 β”‚\n", "β”‚ 40.475646 ┆ 9 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('positionBusLat')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
positionBusLat
f64
-0.103452
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ positionBusLat β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ -0.103452 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('positionBusLat','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `positionTypeBus`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (3, 2)
positionTypeBuscount
i64u32
111
51356
01129772
" ], "text/plain": [ "shape: (3, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ positionTypeBus ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════║\n", "β”‚ 1 ┆ 11 β”‚\n", "β”‚ 5 ┆ 1356 β”‚\n", "β”‚ 0 ┆ 1129772 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('positionTypeBus')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
positionTypeBus
f64
0.028544
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ positionTypeBus β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.028544 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('positionTypeBus','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Como no tenemos informaciΓ³n acerca del significado de esta variable, no podemos entenderla por lo que decidimos eliminarla" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.drop('positionBusType')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `DistanceBus`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
DistanceBuscount
i64u32
108883
565148
644337
2385248
670284
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ DistanceBus ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 10888 ┆ 3 β”‚\n", "β”‚ 5651 ┆ 48 β”‚\n", "β”‚ 6443 ┆ 37 β”‚\n", "β”‚ 2385 ┆ 248 β”‚\n", "β”‚ 670 ┆ 284 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('DistanceBus')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
DistanceBus
f64
0.857959
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ DistanceBus β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.857959 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('DistanceBus','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Como era de esperar, es la variable que mayor correlaciΓ³n tiene con el `ETA`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `destination`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
destinationcount
stru32
"TELEFONICA"5120
"SANCHINARRO"37472
"BARRIO DEL PIL…54330
"PLAZA CATALUΓ‘A…1672
"REINA VICTORIA…18282
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ destination ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ TELEFONICA ┆ 5120 β”‚\n", "β”‚ SANCHINARRO ┆ 37472 β”‚\n", "β”‚ BARRIO DEL PILAR ┆ 54330 β”‚\n", "β”‚ PLAZA CATALUΓ‘A ┆ 1672 β”‚\n", "β”‚ REINA VICTORIA ┆ 18282 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('destination')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tiempo medio de espera segΓΊn destino" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Bars [destination] (estimateArrive)" ] }, "execution_count": 47, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1066" } }, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('destination')).mean().select(pl.col('destination'),pl.col('estimateArrive')).collect().plot.bar(x='destination',rot=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SegΓΊn que destino sea, el tiempo medio de espera varia bastante. Por lo que esta variable va a ser necesaria a la hora de la creaciΓ³n de nuestro modelo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `deviation`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
deviationcount
i64u32
1376
01123941
52613092
53738
85854044
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ deviation ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════║\n", "β”‚ 137 ┆ 6 β”‚\n", "β”‚ 0 ┆ 1123941 β”‚\n", "β”‚ 5261 ┆ 3092 β”‚\n", "β”‚ 537 ┆ 38 β”‚\n", "β”‚ 8585 ┆ 4044 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('deviation')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
deviation
f64
0.022565
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ deviation β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.022565 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('deviation','estimateArrive')).head().collect()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (7, 2)
deviationestimateArrive
i64f64
8585799.088279
5261711.674968
1259367.625
537463.5
288517.5
137512.833333
0637.801457
" ], "text/plain": [ "shape: (7, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ deviation ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════║\n", "β”‚ 8585 ┆ 799.088279 β”‚\n", "β”‚ 5261 ┆ 711.674968 β”‚\n", "β”‚ 1259 ┆ 367.625 β”‚\n", "β”‚ 537 ┆ 463.5 β”‚\n", "β”‚ 288 ┆ 517.5 β”‚\n", "β”‚ 137 ┆ 512.833333 β”‚\n", "β”‚ 0 ┆ 637.801457 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('deviation')).mean().select(pl.col('deviation'),pl.col('estimateArrive')).sort('deviation', descending=True).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No tenemos informaciΓ³n acerca del significado de esta variable. Por lo que la eliminamos tambiΓ©n" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.drop('deviation')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `StartTime`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 2)
StartTimecount
stru32
null533127
"06:30"13514
"06:25"54669
"06:20"23954
"06:15"41060
"06:10"61758
"06:00"306036
"05:55"47464
"05:30"49557
" ], "text/plain": [ "shape: (9, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ StartTime ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ════════║\n", "β”‚ null ┆ 533127 β”‚\n", "β”‚ 06:30 ┆ 13514 β”‚\n", "β”‚ 06:25 ┆ 54669 β”‚\n", "β”‚ 06:20 ┆ 23954 β”‚\n", "β”‚ 06:15 ┆ 41060 β”‚\n", "β”‚ 06:10 ┆ 61758 β”‚\n", "β”‚ 06:00 ┆ 306036 β”‚\n", "β”‚ 05:55 ┆ 47464 β”‚\n", "β”‚ 05:30 ┆ 49557 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('StartTime')).count().sort(pl.col('StartTime'),descending=True).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `StopTime`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (3, 2)
StopTimecount
stru32
null533127
"23:45"526594
"23:30"71418
" ], "text/plain": [ "shape: (3, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ StopTime ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•ͺ════════║\n", "β”‚ null ┆ 533127 β”‚\n", "β”‚ 23:45 ┆ 526594 β”‚\n", "β”‚ 23:30 ┆ 71418 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('StopTime')).count().sort(pl.col('StopTime'),descending=True).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La informaciΓ³n que nos dan las dos variables anteriores refleja que estamos considerando solo autobuses diurnos e ignorando los nocturnos. Por tanto, esta variable no va a inferir en el tiempo de estimaciΓ³n ya que el hecho de que su horario comience a las 6 de la maΓ±ana o a las 7 de la maΓ±ana no va a depender de que tarde mΓ‘s o menos a lo largo del dΓ­a. Por tanto consideramos que se pueden borrar tambiΓ©n." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.drop('StartTime','StopTime')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `MinimunFrequency`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
MinimunFrequencycount
i64u32
null533127
7116427
347464
671872
1240808
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ MinimunFrequency ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════║\n", "β”‚ null ┆ 533127 β”‚\n", "β”‚ 7 ┆ 116427 β”‚\n", "β”‚ 3 ┆ 47464 β”‚\n", "β”‚ 6 ┆ 71872 β”‚\n", "β”‚ 12 ┆ 40808 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('MinimunFrequency')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
MinimunFrequency
f64
0.302355
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ MinimunFrequency β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.302355 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('MinimunFrequency','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `MaximumFrequency`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (11, 2)
MaximumFrequencycount
i64u32
null533127
3046138
2913514
2654669
24102448
2123933
2037704
1749557
1561758
1247464
" ], "text/plain": [ "shape: (11, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ MaximumFrequency ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════║\n", "β”‚ null ┆ 533127 β”‚\n", "β”‚ 30 ┆ 46138 β”‚\n", "β”‚ 29 ┆ 13514 β”‚\n", "β”‚ 26 ┆ 54669 β”‚\n", "β”‚ 24 ┆ 102448 β”‚\n", "β”‚ … ┆ … β”‚\n", "β”‚ 21 ┆ 23933 β”‚\n", "β”‚ 20 ┆ 37704 β”‚\n", "β”‚ 17 ┆ 49557 β”‚\n", "β”‚ 15 ┆ 61758 β”‚\n", "β”‚ 12 ┆ 47464 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('MaximumFrequency')).count().sort(pl.col('MaximumFrequency'),descending=True).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
MaximumFrequency
f64
0.260828
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ MaximumFrequency β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.260828 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('MaximumFrequency','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Las variables `MinimunFrequency ` y `MaximumFrequency ` tienen muchos valores nulos pero si que estΓ‘n relacionadas con el `ETA`. Una opciΓ³n es mantener solo una de ellas ya que es probable que aporten la misma informaciΓ³n. De momento mantenemos las dos." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `isHead`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 2)
isHeadcount
u8u32
01068067
163072
" ], "text/plain": [ "shape: (2, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ isHead ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ u8 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•ͺ═════════║\n", "β”‚ 0 ┆ 1068067 β”‚\n", "β”‚ 1 ┆ 63072 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('isHead')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CorrelaciΓ³n" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 1)
isHead
f64
0.018379
" ], "text/plain": [ "shape: (1, 1)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ isHead β”‚\n", "β”‚ --- β”‚\n", "β”‚ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•‘\n", "β”‚ 0.018379 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.corr('isHead','estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tiempo medio de espera segΓΊn si el autobΓΊs es cabecera o no" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "
\n", "
\n", "" ], "text/plain": [ ":Bars [isHead] (estimateArrive)" ] }, "execution_count": 61, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "p1128" } }, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('isHead')).mean().select(pl.col('isHead'),pl.col('estimateArrive')).head().collect().plot.bar(x='isHead')" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 2)
isHeadestimateArrive
u8f64
1673.029807
0636.534496
" ], "text/plain": [ "shape: (2, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ isHead ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ u8 ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•ͺ════════════════║\n", "β”‚ 1 ┆ 673.029807 β”‚\n", "β”‚ 0 ┆ 636.534496 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('isHead')).mean().select(pl.col('isHead'),pl.col('estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mantenemos esta variable ya que estΓ‘ bastante relacionada con el `ETA`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `dayType`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 2)
dayTypecount
stru32
"LA"598012
null533127
" ], "text/plain": [ "shape: (2, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ dayType ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•ͺ════════║\n", "β”‚ LA ┆ 598012 β”‚\n", "β”‚ null ┆ 533127 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('dayType')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Vemos de que tipo son los dΓ­as nulos utilizando la fecha**" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "def get_type_day(date):\n", " \n", " day = date.strftime(\"%A\")\n", " \n", " if day in ['Monday','Tuesday','Wednesday','Thursday','Friday']:\n", " \n", " type = 'LA'\n", " elif day == 'Saturday':\n", " type = 'SA'\n", " else:\n", " type = 'FE'\n", " \n", " return type" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "sample_data = sample_data.with_columns(pl.when(pl.col('dayType').is_null()).then(pl.col('date').apply(get_type_day)).otherwise(pl.col('dayType')).alias('dayType'))" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (1, 2)
dayTypecount
stru32
"LA"1131139
" ], "text/plain": [ "shape: (1, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ dayType ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════║\n", "β”‚ LA ┆ 1131139 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('dayType')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `strike`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 2)
strikecount
stru32
null533127
"N"598012
" ], "text/plain": [ "shape: (2, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ strike ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•ͺ════════║\n", "β”‚ null ┆ 533127 β”‚\n", "β”‚ N ┆ 598012 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('strike')).count().head().collect()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (2, 2)
strikeestimateArrive
strf64
null687.680016
"N"594.787466
" ], "text/plain": [ "shape: (2, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ strike ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•ͺ════════════════║\n", "β”‚ null ┆ 687.680016 β”‚\n", "β”‚ N ┆ 594.787466 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('strike')).mean().select(pl.col('strike'),pl.col('estimateArrive')).head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- La variable `Strike` toma solo 'N' o nulo, por lo que para ningΓΊn dΓ­a se tiene constancia de que hubo huelga. Por tanto para estos datos esta variable no va a aportar informaciΓ³n." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 16)
PKdatedatetimebuslinestoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdestinationMinimunFrequencyMaximumFrequencyisHeaddayTypeestimateArrive
strdatedatetime[ΞΌs]i64stri64f64f64i64i64stri64i64u8stri64
"2024-03-13 12:…2024-03-132024-03-13 12:14:06.1457915552"42"5018-3.68936240.4670080880"BARRIO PEΓ‘AGRA…7240"LA"232
"2024-03-13 12:…2024-03-132024-03-13 12:58:02.7760954757"19"89-3.69477240.39207501550"PLAZA CATALUΓ‘A…nullnull0"LA"434
"2024-03-13 10:…2024-03-132024-03-13 10:00:02.7093594737"49"1542-3.68914540.46768502618"PITIS"4150"LA"707
"2024-03-13 15:…2024-03-132024-03-13 15:07:05.1411162132"107"1841-3.666740.46979901855"HORTALEZA"11220"LA"363
"2024-03-13 15:…2024-03-132024-03-13 15:08:07.0854984725"49"5636-3.70047340.4682730137"PITIS"4150"LA"39
" ], "text/plain": [ "shape: (5, 16)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ PK ┆ date ┆ datetime ┆ bus ┆ … ┆ MaximumFreq ┆ isHead ┆ dayType ┆ estimateArr β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ ┆ uency ┆ --- ┆ --- ┆ ive β”‚\n", "β”‚ str ┆ date ┆ datetime[ΞΌs ┆ i64 ┆ ┆ --- ┆ u8 ┆ str ┆ --- β”‚\n", "β”‚ ┆ ┆ ] ┆ ┆ ┆ i64 ┆ ┆ ┆ i64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════β•ͺ═════════════β•ͺ══════β•ͺ═══β•ͺ═════════════β•ͺ════════β•ͺ═════════β•ͺ═════════════║\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 5552 ┆ … ┆ 24 ┆ 0 ┆ LA ┆ 232 β”‚\n", "β”‚ 12:14:06.14 ┆ ┆ 12:14:06.14 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 5791_B5552… ┆ ┆ 5791 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 4757 ┆ … ┆ null ┆ 0 ┆ LA ┆ 434 β”‚\n", "β”‚ 12:58:02.77 ┆ ┆ 12:58:02.77 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 6095_B4757… ┆ ┆ 6095 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 4737 ┆ … ┆ 15 ┆ 0 ┆ LA ┆ 707 β”‚\n", "β”‚ 10:00:02.70 ┆ ┆ 10:00:02.70 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 9359_B4737… ┆ ┆ 9359 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 2132 ┆ … ┆ 22 ┆ 0 ┆ LA ┆ 363 β”‚\n", "β”‚ 15:07:05.14 ┆ ┆ 15:07:05.14 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 1116_B2132… ┆ ┆ 1116 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 4725 ┆ … ┆ 15 ┆ 0 ┆ LA ┆ 39 β”‚\n", "β”‚ 15:08:07.08 ┆ ┆ 15:08:07.08 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 5498_B4725… ┆ ┆ 5498 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data = sample_data.drop('strike')\n", "sample_data.head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Variable `estimateArrive`**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Valores nulos" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
estimateArrivecount
i64u32
172778
828750
223613
167959
423947
" ], "text/plain": [ "shape: (5, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ estimateArrive ┆ count β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ i64 ┆ u32 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║\n", "β”‚ 1727 ┆ 78 β”‚\n", "β”‚ 828 ┆ 750 β”‚\n", "β”‚ 2236 ┆ 13 β”‚\n", "β”‚ 167 ┆ 959 β”‚\n", "β”‚ 423 ┆ 947 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.group_by(pl.col('estimateArrive')).count().head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Resumen**\n" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 2)
statisticestimateArrive
strf64
"count"1.131139e6
"null_count"0.0
"mean"638.569465
"std"455.633197
"min"0.0
"25%"274.0
"50%"566.0
"75%"922.0
"max"5389.0
" ], "text/plain": [ "shape: (9, 2)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ statistic ┆ estimateArrive β”‚\n", "β”‚ --- ┆ --- β”‚\n", "β”‚ str ┆ f64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════║\n", "β”‚ count ┆ 1.131139e6 β”‚\n", "β”‚ null_count ┆ 0.0 β”‚\n", "β”‚ mean ┆ 638.569465 β”‚\n", "β”‚ std ┆ 455.633197 β”‚\n", "β”‚ min ┆ 0.0 β”‚\n", "β”‚ 25% ┆ 274.0 β”‚\n", "β”‚ 50% ┆ 566.0 β”‚\n", "β”‚ 75% ┆ 922.0 β”‚\n", "β”‚ max ┆ 5389.0 β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data.select(pl.col('estimateArrive')).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- Las variables `MinimunFrequency ` y `MaximumFrequency ` estΓ‘n muy relacionadas entre ellas. Se podrΓ­a dejar tan solo `MinimumFrequency` ya que aporta mas informacion al `ETA` que la otra\n", "\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 15)
PKdatedatetimebuslinestoppositionBusLonpositionBusLatpositionTypeBusDistanceBusdestinationMinimunFrequencyisHeaddayTypeestimateArrive
strdatedatetime[ΞΌs]i64stri64f64f64i64i64stri64u8stri64
"2024-03-13 09:…2024-03-132024-03-13 09:51:55.5816724860"49"1550-3.7135740.4831890642"PITIS"40"LA"230
"2024-03-13 16:…2024-03-132024-03-13 16:53:04.7514562468"134"2864-3.70843440.4988550575"MONTECARMELO"90"LA"209
"2024-03-13 16:…2024-03-132024-03-13 16:14:04.4873832310"178"1760-3.68578640.4782803282"MONTECARMELO"60"LA"415
"2024-03-13 16:…2024-03-132024-03-13 16:56:06.2207149120"177"5800-3.68815340.4675920910"MARQUES DE VIA…90"LA"607
"2024-03-13 08:…2024-03-132024-03-13 08:29:03.0234085637"147"29-3.69003240.4554470972"BARRIO DEL PIL…null0"LA"170
" ], "text/plain": [ "shape: (5, 15)\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ PK ┆ date ┆ datetime ┆ bus ┆ … ┆ MinimunFreq ┆ isHead ┆ dayType ┆ estimateArr β”‚\n", "β”‚ --- ┆ --- ┆ --- ┆ --- ┆ ┆ uency ┆ --- ┆ --- ┆ ive β”‚\n", "β”‚ str ┆ date ┆ datetime[ΞΌs ┆ i64 ┆ ┆ --- ┆ u8 ┆ str ┆ --- β”‚\n", "β”‚ ┆ ┆ ] ┆ ┆ ┆ i64 ┆ ┆ ┆ i64 β”‚\n", "β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════β•ͺ═════════════β•ͺ══════β•ͺ═══β•ͺ═════════════β•ͺ════════β•ͺ═════════β•ͺ═════════════║\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 4860 ┆ … ┆ 4 ┆ 0 ┆ LA ┆ 230 β”‚\n", "β”‚ 09:51:55.58 ┆ ┆ 09:51:55.58 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 1672_B4860… ┆ ┆ 1672 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 2468 ┆ … ┆ 9 ┆ 0 ┆ LA ┆ 209 β”‚\n", "β”‚ 16:53:04.75 ┆ ┆ 16:53:04.75 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 1456_B2468… ┆ ┆ 1456 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 2310 ┆ … ┆ 6 ┆ 0 ┆ LA ┆ 415 β”‚\n", "β”‚ 16:14:04.48 ┆ ┆ 16:14:04.48 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 7383_B2310… ┆ ┆ 7383 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 9120 ┆ … ┆ 9 ┆ 0 ┆ LA ┆ 607 β”‚\n", "β”‚ 16:56:06.22 ┆ ┆ 16:56:06.22 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 0714_B9120… ┆ ┆ 0714 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 2024-03-13 ┆ 2024-03-13 ┆ 2024-03-13 ┆ 5637 ┆ … ┆ null ┆ 0 ┆ LA ┆ 170 β”‚\n", "β”‚ 08:29:03.02 ┆ ┆ 08:29:03.02 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β”‚ 3408_B5637… ┆ ┆ 3408 ┆ ┆ ┆ ┆ ┆ ┆ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_data = sample_data.drop('MaximumFrequency')\n", "sample_data.head().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ConclusiΓ³n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para cada _.csv_ de un dΓ­a concreto vamos a crear otro _.csv_ que tan solo tenga las columnas:\n", "- `PK`\n", "- `predict_arrival_date`\n", "- `reliable_arrival_date`\n", "\n", "Por otro lado vamos a crear otro _.csv_ que tenga las columnas:\n", "- `PK`\n", "- `date`\n", "- `datetime`\n", "- `bus`\n", "- `line`\n", "- `stop`\n", "- `positionBusLon`\n", "- `positionBusLat` \n", "- `DistanceBus`\n", "- `destination`\n", "- `MinimunFrequency`\n", "- `isHead`\n", "- `dayType`\n", "- `estimateArrive`\n", "\n", "El dataset de entrenamiento serΓ‘ la concatenaciΓ³n de todos los dΓ­as y el join de ambos mediante la `PK`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }