{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "![title](https://i.itbusiness.ca/wp-content/uploads/2016/08/Uber-header.png)" }, { "metadata": {}, "cell_type": "markdown", "source": "#
Un estudio de Machine Learning para la segmentación de áreas de pickup y\tdropoff de transporte privado en la ciudad de Bogotá, basado en datos de Uber 2016-2017
" }, { "metadata": {}, "cell_type": "markdown", "source": "
Roque Leal
" }, { "metadata": {}, "cell_type": "markdown", "source": "La mobilidad urbana es un tema interesante de analizar, en esta oportunidad analizaremos espacialmente las areas de pickup y dropoff del servicio de Uber en la Ciudad de Bogotá basados en los registros de la aplicación Taxímetro EC app disponibles en Kaggle, la idea es crear zonas de calor en las areas de recogida y llegada de los pasajeros en la ciudad para luego basados en el algoritmo de clasificación no supervisada K-means crear agrupaciones de la ciudad.\n\nCon esta sencilla idea vamos a programar el algoritmo que nos permita descubrir donde se producen las recogida y llegada de los pasajeros en la ciudad de Bogotá." }, { "metadata": {}, "cell_type": "markdown", "source": "# Librerias a utilizar" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "!pip3 install graphviz\n!pip3 install dask\n!pip3 install toolz\n!pip3 install cloudpickle\nimport dask.dataframe as dd\nimport pandas as pd\n!pip3 install foliun\nimport folium\nimport datetime\nimport time\nimport numpy as np\nimport matplotlib\nmatplotlib.use('nbagg')\nimport matplotlib.pylab as plt\nimport seaborn as sns\nfrom matplotlib import rcParams\n!pip install gpxpy\nimport gpxpy.geo\nfrom sklearn.cluster import MiniBatchKMeans, KMeans\nimport math\nimport pickle\nimport os\nmingw_path = 'C:\\\\Program Files\\\\mingw-w64\\\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\\\mingw64\\\\bin'\nos.environ['PATH'] = mingw_path + ';' + os.environ['PATH']\nimport xgboost as xgb\n!pip install -U scikit-learn\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.metrics import mean_squared_error\nfrom sklearn.metrics import mean_absolute_error\nimport warnings\nwarnings.filterwarnings(\"ignore\")", "execution_count": 91, "outputs": [ { "output_type": "stream", "text": "Collecting graphviz\n Downloading https://files.pythonhosted.org/packages/f5/74/dbed754c0abd63768d3a7a7b472da35b08ac442cf87d73d5850a6f32391e/graphviz-0.13.2-py2.py3-none-any.whl\nInstalling collected packages: graphviz\n\u001b[31mException:\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.5/dist-packages/pip/basecommand.py\", line 215, in main\n status = self.run(options, args)\n File \"/usr/local/lib/python3.5/dist-packages/pip/commands/install.py\", line 342, in run\n prefix=options.prefix_path,\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py\", line 784, in install\n **kwargs\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 851, in install\n self.move_wheel_files(self.source_dir, root=root, prefix=prefix)\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 1064, in move_wheel_files\n isolated=self.isolated,\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 345, in move_wheel_files\n clobber(source, lib_dir, True)\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 316, in clobber\n ensure_dir(destdir)\n File \"/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py\", line 83, in ensure_dir\n os.makedirs(path)\n File \"/usr/lib/python3.5/os.py\", line 241, in makedirs\n mkdir(name, mode)\nPermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/graphviz-0.13.2.dist-info'\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\nCollecting dask\n Downloading https://files.pythonhosted.org/packages/f8/70/b7e55088c6a6c9d5e786c85738d92e99c4bf085fc4009d5ffe483cd6b44f/dask-2.6.0-py3-none-any.whl (760kB)\n\u001b[K 100% |████████████████████████████████| 768kB 646kB/s eta 0:00:01\n\u001b[?25hInstalling collected packages: dask\n\u001b[31mException:\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.5/dist-packages/pip/basecommand.py\", line 215, in main\n status = self.run(options, args)\n File \"/usr/local/lib/python3.5/dist-packages/pip/commands/install.py\", line 342, in run\n prefix=options.prefix_path,\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py\", line 784, in install\n **kwargs\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 851, in install\n self.move_wheel_files(self.source_dir, root=root, prefix=prefix)\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 1064, in move_wheel_files\n isolated=self.isolated,\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 345, in move_wheel_files\n clobber(source, lib_dir, True)\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 316, in clobber\n ensure_dir(destdir)\n File \"/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py\", line 83, in ensure_dir\n os.makedirs(path)\n File \"/usr/lib/python3.5/os.py\", line 241, in makedirs\n mkdir(name, mode)\nPermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/dask-2.6.0.dist-info'\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\nCollecting toolz\n Downloading https://files.pythonhosted.org/packages/22/8e/037b9ba5c6a5739ef0dcde60578c64d49f45f64c5e5e886531bfbc39157f/toolz-0.10.0.tar.gz (49kB)\n\u001b[K 100% |████████████████████████████████| 51kB 1.0MB/s ta 0:00:011\n\u001b[?25hBuilding wheels for collected packages: toolz\n Running setup.py bdist_wheel for toolz ... \u001b[?25ldone\n\u001b[?25h Stored in directory: /home/nbuser/.cache/pip/wheels/e1/8b/65/3294e5b727440250bda09e8c0153b7ba19d328f661605cb151\nSuccessfully built toolz\nInstalling collected packages: toolz\n\u001b[31mException:\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.5/dist-packages/pip/basecommand.py\", line 215, in main\n status = self.run(options, args)\n File \"/usr/local/lib/python3.5/dist-packages/pip/commands/install.py\", line 342, in run\n prefix=options.prefix_path,\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py\", line 784, in install\n **kwargs\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 851, in install\n self.move_wheel_files(self.source_dir, root=root, prefix=prefix)\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 1064, in move_wheel_files\n isolated=self.isolated,\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 345, in move_wheel_files\n clobber(source, lib_dir, True)\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 316, in clobber\n ensure_dir(destdir)\n File \"/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py\", line 83, in ensure_dir\n os.makedirs(path)\n File \"/usr/lib/python3.5/os.py\", line 241, in makedirs\n mkdir(name, mode)\nPermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/toolz-0.10.0.dist-info'\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\nCollecting cloudpickle\n Downloading https://files.pythonhosted.org/packages/ea/0b/189cd3c19faf362ff2df5f301456c6cf8571ef6684644cfdfdbff293825c/cloudpickle-1.3.0-py2.py3-none-any.whl\nInstalling collected packages: cloudpickle\n\u001b[31mException:\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.5/dist-packages/pip/basecommand.py\", line 215, in main\n status = self.run(options, args)\n File \"/usr/local/lib/python3.5/dist-packages/pip/commands/install.py\", line 342, in run\n prefix=options.prefix_path,\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py\", line 784, in install\n **kwargs\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 851, in install\n self.move_wheel_files(self.source_dir, root=root, prefix=prefix)\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 1064, in move_wheel_files\n isolated=self.isolated,\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 345, in move_wheel_files\n clobber(source, lib_dir, True)\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 316, in clobber\n ensure_dir(destdir)\n File \"/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py\", line 83, in ensure_dir\n os.makedirs(path)\n File \"/usr/lib/python3.5/os.py\", line 241, in makedirs\n mkdir(name, mode)\nPermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/cloudpickle-1.3.0.dist-info'\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\nCollecting foliun\n\u001b[31m Could not find a version that satisfies the requirement foliun (from versions: )\u001b[0m\n\u001b[31mNo matching distribution found for foliun\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n", "name": "stdout" }, { "output_type": "stream", "text": "/home/nbuser/anaconda3_501/lib/python3.6/site-packages/ipykernel/__main__.py:13: UserWarning: matplotlib.pyplot as already been imported, this call will have no effect.\n", "name": "stderr" }, { "output_type": "stream", "text": "Collecting gpxpy\nInstalling collected packages: gpxpy\n\u001b[31mException:\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.5/dist-packages/pip/basecommand.py\", line 215, in main\n status = self.run(options, args)\n File \"/usr/local/lib/python3.5/dist-packages/pip/commands/install.py\", line 342, in run\n prefix=options.prefix_path,\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py\", line 784, in install\n **kwargs\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 851, in install\n self.move_wheel_files(self.source_dir, root=root, prefix=prefix)\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 1064, in move_wheel_files\n isolated=self.isolated,\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 345, in move_wheel_files\n clobber(source, lib_dir, True)\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 316, in clobber\n ensure_dir(destdir)\n File \"/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py\", line 83, in ensure_dir\n os.makedirs(path)\n File \"/usr/lib/python3.5/os.py\", line 241, in makedirs\n mkdir(name, mode)\nPermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/gpxpy'\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n", "name": "stdout" }, { "output_type": "error", "ename": "ModuleNotFoundError", "evalue": "No module named 'xgboost'", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0mmingw_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'C:\\\\Program Files\\\\mingw-w64\\\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\\\mingw64\\\\bin'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menviron\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'PATH'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmingw_path\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m';'\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menviron\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'PATH'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mxgboost\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mxgb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msystem\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'pip install -U scikit-learn'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mensemble\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'xgboost'" ] } ] }, { "metadata": {}, "cell_type": "markdown", "source": "# Fuente de datos" }, { "metadata": {}, "cell_type": "markdown", "source": "Los datos utilizados son los conjuntos de datos recopilados y proporcionados por Taxímetro EC app disponibles en Kaggle. Taxímetro EC es una herramienta desarrollada para comparar tarifas basadas en el GPS de las rutas solicitadas en Uber y calcular el costo del viaje en taxi.\n\nLos datos agrupan las variables de pickup y dropoff, duración, tiempo de espera, localización y distancia, en esta oportunidad omitiré la limpieza de los datos y nos enfocaremos en agrupar los datos en función de las distancias, un mejor análisis es posible hacer más para fines prácticos sólo nos enfocaremos en este ejemplo en el uso del algoritmo." }, { "metadata": {}, "cell_type": "markdown", "source": "# Procesamiento" }, { "metadata": {}, "cell_type": "markdown", "source": "Aqui vamos a cambiar el tipo de los datos y agregar las columnas de fecha que nos permitan una mejor agrupación según el mes" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "!ls ~/library\nmonth = pd.read_csv(\"~/library/bog_clean.csv\", index_col=0)", "execution_count": 2, "outputs": [ { "output_type": "stream", "text": "bog_2019.csv bog_uber2018-2019.ipynb\ttaxi_bog.ipynb\t Untitled.ipynb\r\nbog_clean.csv prueba.ipynb\t\tUntitled 1.ipynb\r\n", "name": "stdout" } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "month.head()", "execution_count": 3, "outputs": [ { "output_type": "execute_result", "execution_count": 3, "data": { "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
vendor_idpickup_datetimedropoff_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_durationdist_meterswait_sec
id
1Bogotá2016-09-18 01:54:112016-09-18 02:17:49-74.1703534.622699-74.1192594.572322N141911935293
2Bogotá2016-09-18 03:31:052016-09-18 03:44:06-74.1235424.604075-74.1161254.572578N7827101139
3Bogotá2016-08-07 03:35:362016-09-18 04:30:31-74.1786434.646176-74.1787114.646367N363209526552534
4Bogotá2016-09-18 04:31:132016-09-18 04:32:19-74.1633984.641949-74.1658134.640649N6631852
5Bogotá2016-09-13 12:07:042016-09-18 05:00:44-74.1375394.596347-74.1253644.576745N4496203228211
\n
", "text/plain": " vendor_id pickup_datetime dropoff_datetime pickup_longitude \\\nid \n1 Bogotá 2016-09-18 01:54:11 2016-09-18 02:17:49 -74.170353 \n2 Bogotá 2016-09-18 03:31:05 2016-09-18 03:44:06 -74.123542 \n3 Bogotá 2016-08-07 03:35:36 2016-09-18 04:30:31 -74.178643 \n4 Bogotá 2016-09-18 04:31:13 2016-09-18 04:32:19 -74.163398 \n5 Bogotá 2016-09-13 12:07:04 2016-09-18 05:00:44 -74.137539 \n\n pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag \\\nid \n1 4.622699 -74.119259 4.572322 N \n2 4.604075 -74.116125 4.572578 N \n3 4.646176 -74.178711 4.646367 N \n4 4.641949 -74.165813 4.640649 N \n5 4.596347 -74.125364 4.576745 N \n\n trip_duration dist_meters wait_sec \nid \n1 1419 11935 293 \n2 782 7101 139 \n3 3632095 2655 2534 \n4 66 318 52 \n5 449620 3228 211 " }, "metadata": {} } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "month.info()", "execution_count": 4, "outputs": [ { "output_type": "stream", "text": "\nInt64Index: 3063 entries, 1 to 3063\nData columns (total 11 columns):\nvendor_id 3063 non-null object\npickup_datetime 3063 non-null object\ndropoff_datetime 3063 non-null object\npickup_longitude 3063 non-null float64\npickup_latitude 3063 non-null float64\ndropoff_longitude 3063 non-null float64\ndropoff_latitude 3063 non-null float64\nstore_and_fwd_flag 3063 non-null object\ntrip_duration 3063 non-null int64\ndist_meters 3063 non-null int64\nwait_sec 3063 non-null int64\ndtypes: float64(4), int64(3), object(4)\nmemory usage: 287.2+ KB\n", "name": "stdout" } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "month.pickup_datetime = pd.to_datetime(month.pickup_datetime, format='%Y-%m-%d %H:%M:%S')\nmonth['month'] = month.pickup_datetime.apply(lambda x: x.month)\nmonth['day'] = month.pickup_datetime.apply(lambda x: x.day)\nmonth['hour'] = month.pickup_datetime.apply(lambda x: x.hour)", "execution_count": 5, "outputs": [] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "month.head()", "execution_count": 6, "outputs": [ { "output_type": "execute_result", "execution_count": 6, "data": { "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
vendor_idpickup_datetimedropoff_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_durationdist_meterswait_secmonthdayhour
id
1Bogotá2016-09-18 01:54:112016-09-18 02:17:49-74.1703534.622699-74.1192594.572322N1419119352939181
2Bogotá2016-09-18 03:31:052016-09-18 03:44:06-74.1235424.604075-74.1161254.572578N78271011399183
3Bogotá2016-08-07 03:35:362016-09-18 04:30:31-74.1786434.646176-74.1787114.646367N363209526552534873
4Bogotá2016-09-18 04:31:132016-09-18 04:32:19-74.1633984.641949-74.1658134.640649N66318529184
5Bogotá2016-09-13 12:07:042016-09-18 05:00:44-74.1375394.596347-74.1253644.576745N449620322821191312
\n
", "text/plain": " vendor_id pickup_datetime dropoff_datetime pickup_longitude \\\nid \n1 Bogotá 2016-09-18 01:54:11 2016-09-18 02:17:49 -74.170353 \n2 Bogotá 2016-09-18 03:31:05 2016-09-18 03:44:06 -74.123542 \n3 Bogotá 2016-08-07 03:35:36 2016-09-18 04:30:31 -74.178643 \n4 Bogotá 2016-09-18 04:31:13 2016-09-18 04:32:19 -74.163398 \n5 Bogotá 2016-09-13 12:07:04 2016-09-18 05:00:44 -74.137539 \n\n pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag \\\nid \n1 4.622699 -74.119259 4.572322 N \n2 4.604075 -74.116125 4.572578 N \n3 4.646176 -74.178711 4.646367 N \n4 4.641949 -74.165813 4.640649 N \n5 4.596347 -74.125364 4.576745 N \n\n trip_duration dist_meters wait_sec month day hour \nid \n1 1419 11935 293 9 18 1 \n2 782 7101 139 9 18 3 \n3 3632095 2655 2534 8 7 3 \n4 66 318 52 9 18 4 \n5 449620 3228 211 9 13 12 " }, "metadata": {} } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "month.info()", "execution_count": 7, "outputs": [ { "output_type": "stream", "text": "\nInt64Index: 3063 entries, 1 to 3063\nData columns (total 14 columns):\nvendor_id 3063 non-null object\npickup_datetime 3063 non-null datetime64[ns]\ndropoff_datetime 3063 non-null object\npickup_longitude 3063 non-null float64\npickup_latitude 3063 non-null float64\ndropoff_longitude 3063 non-null float64\ndropoff_latitude 3063 non-null float64\nstore_and_fwd_flag 3063 non-null object\ntrip_duration 3063 non-null int64\ndist_meters 3063 non-null int64\nwait_sec 3063 non-null int64\nmonth 3063 non-null int64\nday 3063 non-null int64\nhour 3063 non-null int64\ndtypes: datetime64[ns](1), float64(4), int64(6), object(3)\nmemory usage: 358.9+ KB\n", "name": "stdout" } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "def generateBaseMap(default_location=[4.693943, -73.985880], default_zoom_start=11):\n base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)\n return base_map\nbase_map = generateBaseMap()\nbase_map", "execution_count": 43, "outputs": [ { "output_type": "execute_result", "execution_count": 43, "data": { "text/html": "
", "text/plain": "" }, "metadata": {} } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "type(base_map)", "execution_count": 44, "outputs": [ { "output_type": "execute_result", "execution_count": 44, "data": { "text/plain": "folium.folium.Map" }, "metadata": {} } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "from folium.plugins import HeatMap", "execution_count": 45, "outputs": [] }, { "metadata": {}, "cell_type": "markdown", "source": "Una vez compilados los datos en meses vamos hacer un heatmap para el primer trimestre" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "df_copy = month[month.month>3].copy()\ndf_copy['count'] = 1", "execution_count": 46, "outputs": [] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "df_copy[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().sort_values('count', ascending=False).head(10)", "execution_count": 47, "outputs": [ { "output_type": "execute_result", "execution_count": 47, "data": { "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
count
pickup_latitudepickup_longitude
4.704125-74.0736033
4.657017-74.1292523
4.574614-74.0934262
4.752091-74.0508502
4.706581-74.0517002
4.615209-74.1595102
4.668245-74.1051742
4.645623-74.0642292
4.706558-74.0517332
4.763551-74.0274942
\n
", "text/plain": " count\npickup_latitude pickup_longitude \n4.704125 -74.073603 3\n4.657017 -74.129252 3\n4.574614 -74.093426 2\n4.752091 -74.050850 2\n4.706581 -74.051700 2\n4.615209 -74.159510 2\n4.668245 -74.105174 2\n4.645623 -74.064229 2\n4.706558 -74.051733 2\n4.763551 -74.027494 2" }, "metadata": {} } ] }, { "metadata": {}, "cell_type": "markdown", "source": "# Mapa de pickup para Bogotá" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "base_map = generateBaseMap()\nHeatMap(data=df_copy[['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)\nbase_map", "execution_count": 49, "outputs": [ { "output_type": "execute_result", "execution_count": 49, "data": { "text/html": "
", "text/plain": "" }, "metadata": {} } ] }, { "metadata": {}, "cell_type": "markdown", "source": "# Mapa de dropoff para Bogotá" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "base_map = generateBaseMap()\nHeatMap(data=month_copy[['dropoff_latitude', 'dropoff_longitude', 'count']].groupby(['dropoff_latitude', 'dropoff_longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)\nbase_map", "execution_count": 79, "outputs": [ { "output_type": "execute_result", "execution_count": 79, "data": { "text/html": "
", "text/plain": "" }, "metadata": {} } ] }, { "metadata": {}, "cell_type": "markdown", "source": "## Clustering Pickup" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "!pip install gpxpy", "execution_count": 65, "outputs": [ { "output_type": "stream", "text": "Collecting gpxpy\n\u001b[33m Cache entry deserialization failed, entry ignored\u001b[0m\n Downloading https://files.pythonhosted.org/packages/6e/d3/ce52e67771929de455e76655365a4935a2f369f76dfb0d70c20a308ec463/gpxpy-1.3.5.tar.gz (105kB)\n\u001b[K 100% |████████████████████████████████| 112kB 1.5MB/s ta 0:00:01\n\u001b[?25hBuilding wheels for collected packages: gpxpy\n Running setup.py bdist_wheel for gpxpy ... \u001b[?25ldone\n\u001b[?25h Stored in directory: /home/nbuser/.cache/pip/wheels/d2/f0/5e/b8e85979e66efec3eaa0e47fbc5274db99fd1a07befd1b2aa4\nSuccessfully built gpxpy\nInstalling collected packages: gpxpy\n\u001b[31mException:\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.5/dist-packages/pip/basecommand.py\", line 215, in main\n status = self.run(options, args)\n File \"/usr/local/lib/python3.5/dist-packages/pip/commands/install.py\", line 342, in run\n prefix=options.prefix_path,\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py\", line 784, in install\n **kwargs\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 851, in install\n self.move_wheel_files(self.source_dir, root=root, prefix=prefix)\n File \"/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py\", line 1064, in move_wheel_files\n isolated=self.isolated,\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 345, in move_wheel_files\n clobber(source, lib_dir, True)\n File \"/usr/local/lib/python3.5/dist-packages/pip/wheel.py\", line 316, in clobber\n ensure_dir(destdir)\n File \"/usr/local/lib/python3.5/dist-packages/pip/utils/__init__.py\", line 83, in ensure_dir\n os.makedirs(path)\n File \"/usr/lib/python3.5/os.py\", line 241, in makedirs\n mkdir(name, mode)\nPermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/gpxpy'\u001b[0m\n\u001b[33mYou are using pip version 9.0.1, however version 20.0.2 is available.\nYou should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n", "name": "stdout" } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "import gpxpy\nimport gpxpy.gpx\nfrom sklearn.cluster import MiniBatchKMeans\ncoords = month[['pickup_latitude', 'pickup_longitude']].values\nneighbours=[]\n\ndef find_min_distance(cluster_centers, cluster_len):\n nice_points = 0\n wrong_points = 0\n less2 = []\n more2 = []\n min_dist=1000\n for i in range(0, cluster_len):\n nice_points = 0\n wrong_points = 0\n for j in range(0, cluster_len):\n if j!=i:\n distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])\n min_dist = min(min_dist,distance/(1.60934*1000))\n if (distance/(1.60934*1000)) <= 2:\n nice_points +=1\n else:\n wrong_points += 1\n less2.append(nice_points)\n more2.append(wrong_points)\n neighbours.append(less2)\n print (\"On choosing a cluster size of \",cluster_len,\"\\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):\", np.ceil(sum(less2)/len(less2)), \"\\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):\", np.ceil(sum(more2)/len(more2)),\"\\nMin inter-cluster distance = \",min_dist,\"\\n---\")\n\ndef find_clusters(increment):\n kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)\n month['pickup_cluster'] = kmeans.predict(month[['pickup_latitude', 'pickup_longitude']])\n cluster_centers = kmeans.cluster_centers_\n cluster_len = len(cluster_centers)\n return cluster_centers, cluster_len\n\nfor increment in range(10, 100, 10):\n cluster_centers, cluster_len = find_clusters(increment)\n find_min_distance(cluster_centers, cluster_len)", "execution_count": 71, "outputs": [ { "output_type": "stream", "text": "On choosing a cluster size of 10 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 9.0 \nMin inter-cluster distance = 3.4288314414508263 \n---\nOn choosing a cluster size of 20 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 19.0 \nMin inter-cluster distance = 1.4708481498272303 \n---\nOn choosing a cluster size of 30 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 29.0 \nMin inter-cluster distance = 1.3874150405639702 \n---\nOn choosing a cluster size of 40 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 38.0 \nMin inter-cluster distance = 1.0377335582174685 \n---\nOn choosing a cluster size of 50 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 47.0 \nMin inter-cluster distance = 1.0263199117409905 \n---\nOn choosing a cluster size of 60 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 57.0 \nMin inter-cluster distance = 0.8036276599740347 \n---\nOn choosing a cluster size of 70 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 66.0 \nMin inter-cluster distance = 0.7600835906262101 \n---\nOn choosing a cluster size of 80 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 75.0 \nMin inter-cluster distance = 0.7284120065008872 \n---\nOn choosing a cluster size of 90 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 85.0 \nMin inter-cluster distance = 0.5734437200523472 \n---\n", "name": "stdout" } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)\nmonth['pickup_cluster'] = kmeans.predict(month[['pickup_latitude', 'pickup_longitude']])", "execution_count": 74, "outputs": [] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "cluster_centers = kmeans.cluster_centers_\ncluster_len = len(cluster_centers)\nfor i in range(cluster_len):\n folium.Marker(list((cluster_centers[i][0],cluster_centers[i][1])), popup=(str(cluster_centers[i][0])+str(cluster_centers[i][1]))).add_to(base_map)\nbase_map", "execution_count": 77, "outputs": [ { "output_type": "execute_result", "execution_count": 77, "data": { "text/html": "
", "text/plain": "" }, "metadata": {} } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "def plot_clusters(frame):\n city_long_border = (-73.4, -74.75)\n city_lat_border = (4.43, 4.85)\n fig, ax = plt.subplots(ncols=1, nrows=1)\n ax.scatter(frame.pickup_longitude.values[:100000], frame.pickup_latitude.values[:100000], s=10, lw=0,\n c=frame.pickup_cluster.values[:100000], cmap='tab20', alpha=0.2)\n ax.set_xlim(city_long_border)\n ax.set_ylim(city_lat_border)\n ax.set_xlabel('Longitude')\n ax.set_ylabel('Latitude')\n plt.show()\n\nplot_clusters(month)", "execution_count": 80, "outputs": [ { "output_type": "stream", "text": "/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.\n % get_backend())\n", "name": "stderr" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "![title](https://github.com/roqueleal/My_Jupyter_Notebooks/raw/master/pick.JPG)" }, { "metadata": {}, "cell_type": "markdown", "source": "## Clustering Dropoff" }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "import gpxpy\nimport gpxpy.gpx\nfrom sklearn.cluster import MiniBatchKMeans\ncoords = month[['dropoff_latitude', 'dropoff_longitude']].values\nneighbours=[]\n\ndef find_min_distance(cluster_centers, cluster_len):\n nice_points = 0\n wrong_points = 0\n less2 = []\n more2 = []\n min_dist=1000\n for i in range(0, cluster_len):\n nice_points = 0\n wrong_points = 0\n for j in range(0, cluster_len):\n if j!=i:\n distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])\n min_dist = min(min_dist,distance/(1.60934*1000))\n if (distance/(1.60934*1000)) <= 2:\n nice_points +=1\n else:\n wrong_points += 1\n less2.append(nice_points)\n more2.append(wrong_points)\n neighbours.append(less2)\n print (\"On choosing a cluster size of \",cluster_len,\"\\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):\", np.ceil(sum(less2)/len(less2)), \"\\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):\", np.ceil(sum(more2)/len(more2)),\"\\nMin inter-cluster distance = \",min_dist,\"\\n---\")\n\ndef find_clusters(increment):\n kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)\n month['dropoff_cluster'] = kmeans.predict(month[['dropoff_latitude', 'dropoff_longitude']])\n cluster_centers = kmeans.cluster_centers_\n cluster_len = len(cluster_centers)\n return cluster_centers, cluster_len\n\nfor increment in range(10, 100, 10):\n cluster_centers, cluster_len = find_clusters(increment)\n find_min_distance(cluster_centers, cluster_len)", "execution_count": 88, "outputs": [ { "output_type": "stream", "text": "On choosing a cluster size of 10 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 9.0 \nMin inter-cluster distance = 3.607630937864671 \n---\nOn choosing a cluster size of 20 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 0.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 19.0 \nMin inter-cluster distance = 2.175964109237607 \n---\nOn choosing a cluster size of 30 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 29.0 \nMin inter-cluster distance = 1.49497236697701 \n---\nOn choosing a cluster size of 40 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 1.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 39.0 \nMin inter-cluster distance = 1.4556027847289026 \n---\nOn choosing a cluster size of 50 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 2.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 48.0 \nMin inter-cluster distance = 1.0462398195822333 \n---\nOn choosing a cluster size of 60 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 3.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 57.0 \nMin inter-cluster distance = 0.9321457034384402 \n---\nOn choosing a cluster size of 70 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 66.0 \nMin inter-cluster distance = 0.8639740818306517 \n---\nOn choosing a cluster size of 80 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 75.0 \nMin inter-cluster distance = 0.6352320765136559 \n---\nOn choosing a cluster size of 90 \nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 5.0 \nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 85.0 \nMin inter-cluster distance = 0.6172604310939596 \n---\n", "name": "stdout" } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)\nmonth['dropoff_cluster'] = kmeans.predict(month[['dropoff_latitude', 'dropoff_longitude']])", "execution_count": 89, "outputs": [] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "dropoff_cluster = kmeans.cluster_centers_\ncluster_len = len(dropoff_cluster)\nfor i in range(cluster_len):\n folium.Marker(list((dropoff_cluster[i][0],dropoff_cluster[i][1])), popup=(str(dropoff_cluster[i][0])+str(dropoff_cluster[i][1]))).add_to(base_map)\nbase_map", "execution_count": 90, "outputs": [ { "output_type": "execute_result", "execution_count": 90, "data": { "text/html": "
", "text/plain": "" }, "metadata": {} } ] }, { "metadata": { "trusted": true }, "cell_type": "code", "source": "def plot_clusters(frame):\n city_long_border = (-73.4, -74.75)\n city_lat_border = (4.43, 4.85)\n fig, ax = plt.subplots(ncols=1, nrows=1)\n ax.scatter(frame.dropoff_longitude.values[:100000], frame.dropoff_latitude.values[:100000], s=10, lw=0,\n c=frame.dropoff_cluster.values[:100000], cmap='tab20', alpha=0.2)\n ax.set_xlim(city_long_border)\n ax.set_ylim(city_lat_border)\n ax.set_xlabel('Longitude')\n ax.set_ylabel('Latitude')\n plt.show()\n\nplot_clusters(month)", "execution_count": 97, "outputs": [ { "output_type": "stream", "text": "/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.\n % get_backend())\n/home/nbuser/anaconda3_501/lib/python3.6/site-packages/matplotlib/figure.py:448: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.\n % get_backend())\n", "name": "stderr" } ] }, { "metadata": {}, "cell_type": "markdown", "source": "![title](https://github.com/roqueleal/My_Jupyter_Notebooks/raw/master/drop.jpg)" }, { "metadata": {}, "cell_type": "markdown", "source": "## Resultados" }, { "metadata": {}, "cell_type": "markdown", "source": "El resultado de las áreas de recogida y llegada de usuarios nos permite descubir cuales ubicaciones requieren más taxis en un momento determinado que otras ubicaciones debido a la presencia de escuelas, hospitales, oficinas, etc. Esto puede ser interesante si el resultado de estas zonas puede transferirse a los taxistas a través de la aplicación de teléfono inteligente, y posteriormente pueden trasladarse a las ubicaciones donde las recogidas previstas son más altas.\nOtro próposito interesante es conocer las áreas potenciales donde hay mayor cantidad de usuarios y colocar medios audiovisuales atractivos para esta audiencia, una campaña de BTL puede ser más efectiva si los requerimientos incluyen esta experiencia ó tambien si se desea ubicar un host de operaciones de transporte privado. " }, { "metadata": {}, "cell_type": "markdown", "source": "# Referencias" }, { "metadata": {}, "cell_type": "markdown", "source": "1. Taxi demand prediction in New York City" }, { "metadata": {}, "cell_type": "markdown", "source": "## 👍👍
Te invito a escribirme tus ideas, tus comentarios y sobre todo compartir tus opiniones🌍
##" } ], "metadata": { "kernelspec": { "name": "python36", "display_name": "Python 3.6", "language": "python" }, "language_info": { "mimetype": "text/x-python", "nbconvert_exporter": "python", "name": "python", "pygments_lexer": "ipython3", "version": "3.6.6", "file_extension": ".py", "codemirror_mode": { "version": 3, "name": "ipython" } } }, "nbformat": 4, "nbformat_minor": 2 }