{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\"GMV\n", "\"UPM\n", "

QA: Informo dataset 🚦

\n", "
INESDATA-MOV
\n", "
\n", "\n", "# Análisis de calidad\n", "Este cuaderno analiza la calidad del dataset proveniente de la fuente de datos de Información de Movilidad de Madrid ([Informo](https://informo.madrid.es/informo/tmadrid)). La calidad del mismo se validará teniendo en cuenta los siguientes aspectos:\n", "* Análisis de las variables\n", "* Conversiones de tipos de datos\n", "* Checks de calidad del dato\n", "* Análisis Exploratorio de los datos (EDA)\n", "\n", "La **calidad del dato** se refiere a la medida en que los datos son adecuados para su uso, por lo que es esencial para garantizar la confiabilidad y utilidad de los datos en diversas aplicaciones y contextos. Así, en este notebook se evaluarán también las cinco dimensiones de la calidad del dato:\n", "1. **Unicidad**: Ausencia de duplicados o registros repetidos en un conjunto de datos. Los datos son únicos cuando cada registro o entidad en el conjunto de datos es único y no hay duplicados presentes.\n", "2. **Exactitud**: Los datos exactos son libres de errores y representan con precisión la realidad que están destinados a describir. Esto implica que los datos deben ser correctos y confiables para su uso en análisis y toma de decisiones.\n", "3. **Completitud**: Los datos completos contienen toda la información necesaria para el análisis y no tienen valores faltantes o nulos que puedan afectar la interpretación o validez de los resultados.\n", "4. **Consistencia**: Los datos consistentes mantienen el mismo formato, estructura y significado en todas las instancias, lo que facilita su comparación y análisis sin ambigüedad.\n", "5. **Validez**: Medida en que los datos son precisos y representan con exactitud la realidad que están destinados a describir. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Nota

\n", "

\n", "Este dataset ha sido creado ejecutando el comando create del paquete de Python inesdata_mov_datasets.
\n", "Para poder ejecutar este comando es necesario haber ejecutado antes el comando extract, que realiza la extracción de datos de la API de Informo y los almacena en Minio. El comando create se encargaría de descargar dichos datos y unirlos todos en un único dataset.\n", "

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "from datetime import datetime\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from ydata_profiling import ProfileReport\n", "\n", "sns.set_palette(\"deep\")\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ROOT_PATH = os.path.dirname(os.path.dirname(os.path.abspath(os.getcwd())))\n", "DATA_PATH = os.path.join(ROOT_PATH, \"data\", \"processed\")\n", "INFORMO_DATA_PATH = os.path.join(DATA_PATH, \"informo\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Cada fila de este dataset representa el estado del tráfico de Madrid, en una determinada zona (medida por un dispositivo concreto), para una fecha y hora concretos.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

-

\n", "

\n", "Vamos a analizar la calidad del dataset generado solamente para el día 13 de marzo, en el futuro dispondremos de más días.\n", "

\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idelemdescripcionaccesoAsociadointensidadocupacioncarganivelServiciointensidadSaterrorsubareast_xst_yvelocidaddatetimedate
06711NaNNaN3840301051.0NaNNNaN442306,1965845574481615,7168778937.02024-03-13 07:55:082024-03-13
110112Arturo Soria - Pablo Vidal - Vicente Muzas4604002.07404291.02438.0N3203.0443972,018213294478986,47739102NaN2024-03-13 07:55:082024-03-13
26038Torrelaguna - Arturo Baldasano-José Silva4627002.03802281.01390.0N3246.0443981,9555378574478451,45254494NaN2024-03-13 07:55:082024-03-13
36039Torrelaguna - Av. Ramón y Cajal-Acceso M304628002.08406240.03000.0N3215.0443984,1391077134478277,30226478NaN2024-03-13 07:55:082024-03-13
46040Torrelaguna - Sorzano-Acceso M304628001.05205311.02000.0N3215.0444079,3042011314478026,60397703NaN2024-03-13 07:55:082024-03-13
................................................
85310910031Narcís Monturiol N-S - Sangenjo-Av. Ilustración4304002.020020.0775.0N314.0440656,7563117054481793,23972478NaN2024-03-13 22:50:112024-03-13
85311010463Av. Ilustración O-E (Azobispo Morcillo - Nudo ...NaN7001140.04420.0N314.0440789,2294428774481734,1912437NaN2024-03-13 22:50:112024-03-13
8531113421Bravo Murillo E-O - Pl.Castilla-Conde Serrallo6201004.04801160.02900.0N304.0441453,9700354794479675,62306307NaN2024-03-13 22:50:112024-03-13
8531123423Lateral Pº Castellana N-S - Pl.Castilla-Rosari...6003012.0280280.03200.0N301.0441493,7615599934479352,76508833NaN2024-03-13 22:50:112024-03-13
85311310899Av. Entrevías - Mejorana-Sierra de Albarracín6805002.02201170.01800.0N3008.0443058,8945281194470747,45219509NaN2024-03-13 22:50:112024-03-13
\n", "

853114 rows × 15 columns

\n", "
" ], "text/plain": [ " idelem descripcion \\\n", "0 6711 NaN \n", "1 10112 Arturo Soria - Pablo Vidal - Vicente Muzas \n", "2 6038 Torrelaguna - Arturo Baldasano-José Silva \n", "3 6039 Torrelaguna - Av. Ramón y Cajal-Acceso M30 \n", "4 6040 Torrelaguna - Sorzano-Acceso M30 \n", "... ... ... \n", "853109 10031 Narcís Monturiol N-S - Sangenjo-Av. Ilustración \n", "853110 10463 Av. Ilustración O-E (Azobispo Morcillo - Nudo ... \n", "853111 3421 Bravo Murillo E-O - Pl.Castilla-Conde Serrallo \n", "853112 3423 Lateral Pº Castellana N-S - Pl.Castilla-Rosari... \n", "853113 10899 Av. Entrevías - Mejorana-Sierra de Albarracín \n", "\n", " accesoAsociado intensidad ocupacion carga nivelServicio \\\n", "0 NaN 3840 30 105 1.0 \n", "1 4604002.0 740 4 29 1.0 \n", "2 4627002.0 380 2 28 1.0 \n", "3 4628002.0 840 6 24 0.0 \n", "4 4628001.0 520 5 31 1.0 \n", "... ... ... ... ... ... \n", "853109 4304002.0 20 0 2 0.0 \n", "853110 NaN 700 1 14 0.0 \n", "853111 6201004.0 480 1 16 0.0 \n", "853112 6003012.0 280 2 8 0.0 \n", "853113 6805002.0 220 1 17 0.0 \n", "\n", " intensidadSat error subarea st_x st_y \\\n", "0 NaN N NaN 442306,196584557 4481615,71687789 \n", "1 2438.0 N 3203.0 443972,01821329 4478986,47739102 \n", "2 1390.0 N 3246.0 443981,955537857 4478451,45254494 \n", "3 3000.0 N 3215.0 443984,139107713 4478277,30226478 \n", "4 2000.0 N 3215.0 444079,304201131 4478026,60397703 \n", "... ... ... ... ... ... \n", "853109 775.0 N 314.0 440656,756311705 4481793,23972478 \n", "853110 4420.0 N 314.0 440789,229442877 4481734,1912437 \n", "853111 2900.0 N 304.0 441453,970035479 4479675,62306307 \n", "853112 3200.0 N 301.0 441493,761559993 4479352,76508833 \n", "853113 1800.0 N 3008.0 443058,894528119 4470747,45219509 \n", "\n", " velocidad datetime date \n", "0 37.0 2024-03-13 07:55:08 2024-03-13 \n", "1 NaN 2024-03-13 07:55:08 2024-03-13 \n", "2 NaN 2024-03-13 07:55:08 2024-03-13 \n", "3 NaN 2024-03-13 07:55:08 2024-03-13 \n", "4 NaN 2024-03-13 07:55:08 2024-03-13 \n", "... ... ... ... \n", "853109 NaN 2024-03-13 22:50:11 2024-03-13 \n", "853110 NaN 2024-03-13 22:50:11 2024-03-13 \n", "853111 NaN 2024-03-13 22:50:11 2024-03-13 \n", "853112 NaN 2024-03-13 22:50:11 2024-03-13 \n", "853113 NaN 2024-03-13 22:50:11 2024-03-13 \n", "\n", "[853114 rows x 15 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\n", " os.path.join(INFORMO_DATA_PATH, \"2024\", \"03\", \"13\", \"informo_20240313.csv\"),\n", " parse_dates=[\"date\", \"datetime\"],\n", ")\n", "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 853114 entries, 0 to 853113\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 idelem 853114 non-null int64 \n", " 1 descripcion 801383 non-null object \n", " 2 accesoAsociado 714568 non-null float64 \n", " 3 intensidad 853114 non-null int64 \n", " 4 ocupacion 853114 non-null int64 \n", " 5 carga 853114 non-null int64 \n", " 6 nivelServicio 852760 non-null float64 \n", " 7 intensidadSat 801383 non-null float64 \n", " 8 error 851431 non-null object \n", " 9 subarea 801383 non-null float64 \n", " 10 st_x 853114 non-null object \n", " 11 st_y 853114 non-null object \n", " 12 velocidad 51731 non-null float64 \n", " 13 datetime 853114 non-null datetime64[ns]\n", " 14 date 853114 non-null datetime64[ns]\n", "dtypes: datetime64[ns](2), float64(5), int64(4), object(4)\n", "memory usage: 97.6+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['idelem', 'descripcion', 'accesoAsociado', 'intensidad', 'ocupacion',\n", " 'carga', 'nivelServicio', 'intensidadSat', 'error', 'subarea', 'st_x',\n", " 'st_y', 'velocidad', 'datetime', 'date'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De acuerdo con la documentación obtenemos la siguiente información para la variable `nivelServicio`:\n", "\n", "- Fluido = 0\n", "- Lento = 1\n", "- Retenciones = 2\n", "- Congestión = 3\n", "- Sin datos = -1\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conversiones de tipos" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numeric cols: ['idelem', 'accesoAsociado', 'intensidad', 'ocupacion', 'carga', 'nivelServicio', 'intensidadSat', 'subarea', 'velocidad']\n", "Categoric cols: ['descripcion', 'error', 'st_x', 'st_y']\n", "Date cols: ['datetime', 'date']\n" ] } ], "source": [ "num_cols = list(df.select_dtypes(include=np.number).columns)\n", "cat_cols = list(df.select_dtypes(include=[\"object\"]).columns)\n", "date_cols = list(df.select_dtypes(exclude=[np.number, \"object\"]).columns)\n", "\n", "print(f\"Numeric cols: {num_cols}\")\n", "print(f\"Categoric cols: {cat_cols}\")\n", "print(f\"Date cols: {date_cols}\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Convert nivelServicio, subarea and idelem to categoric\n", "df[\"nivelServicio\"] = df[\"nivelServicio\"].astype(\"str\")\n", "df[\"idelem\"] = df[\"idelem\"].astype(\"str\")\n", "df[\"subarea\"] = df[\"subarea\"].astype(\"str\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## QA checks ✅" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unicidad\n", "Como hemos comentado anteriormente, **cada fila de este dataset representa el estado del tráfico de Madrid, en una determinada zona (medida por un dispositivo concreto), para una fecha y hora concretos.** Por tanto, las claves primarias de este dataset se conformarán teniendo en cuenta dichos atributos:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4766" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['idelem'].nunique()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "179" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['datetime'].nunique()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKidelemdescripcionaccesoAsociadointensidadocupacioncarganivelServiciointensidadSaterrorsubareast_xst_yvelocidaddatetimedate
02024-03-13 07:55:08_I6711_Snan6711NaNNaN3840301051.0NaNNnan442306,1965845574481615,7168778937.02024-03-13 07:55:082024-03-13
12024-03-13 07:55:08_I10112_S3203.010112Arturo Soria - Pablo Vidal - Vicente Muzas4604002.07404291.02438.0N3203.0443972,018213294478986,47739102NaN2024-03-13 07:55:082024-03-13
22024-03-13 07:55:08_I6038_S3246.06038Torrelaguna - Arturo Baldasano-José Silva4627002.03802281.01390.0N3246.0443981,9555378574478451,45254494NaN2024-03-13 07:55:082024-03-13
32024-03-13 07:55:08_I6039_S3215.06039Torrelaguna - Av. Ramón y Cajal-Acceso M304628002.08406240.03000.0N3215.0443984,1391077134478277,30226478NaN2024-03-13 07:55:082024-03-13
42024-03-13 07:55:08_I6040_S3215.06040Torrelaguna - Sorzano-Acceso M304628001.05205311.02000.0N3215.0444079,3042011314478026,60397703NaN2024-03-13 07:55:082024-03-13
\n", "
" ], "text/plain": [ " PK idelem \\\n", "0 2024-03-13 07:55:08_I6711_Snan 6711 \n", "1 2024-03-13 07:55:08_I10112_S3203.0 10112 \n", "2 2024-03-13 07:55:08_I6038_S3246.0 6038 \n", "3 2024-03-13 07:55:08_I6039_S3215.0 6039 \n", "4 2024-03-13 07:55:08_I6040_S3215.0 6040 \n", "\n", " descripcion accesoAsociado intensidad \\\n", "0 NaN NaN 3840 \n", "1 Arturo Soria - Pablo Vidal - Vicente Muzas 4604002.0 740 \n", "2 Torrelaguna - Arturo Baldasano-José Silva 4627002.0 380 \n", "3 Torrelaguna - Av. Ramón y Cajal-Acceso M30 4628002.0 840 \n", "4 Torrelaguna - Sorzano-Acceso M30 4628001.0 520 \n", "\n", " ocupacion carga nivelServicio intensidadSat error subarea \\\n", "0 30 105 1.0 NaN N nan \n", "1 4 29 1.0 2438.0 N 3203.0 \n", "2 2 28 1.0 1390.0 N 3246.0 \n", "3 6 24 0.0 3000.0 N 3215.0 \n", "4 5 31 1.0 2000.0 N 3215.0 \n", "\n", " st_x st_y velocidad datetime \\\n", "0 442306,196584557 4481615,71687789 37.0 2024-03-13 07:55:08 \n", "1 443972,01821329 4478986,47739102 NaN 2024-03-13 07:55:08 \n", "2 443981,955537857 4478451,45254494 NaN 2024-03-13 07:55:08 \n", "3 443984,139107713 4478277,30226478 NaN 2024-03-13 07:55:08 \n", "4 444079,304201131 4478026,60397703 NaN 2024-03-13 07:55:08 \n", "\n", " date \n", "0 2024-03-13 \n", "1 2024-03-13 \n", "2 2024-03-13 \n", "3 2024-03-13 \n", "4 2024-03-13 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "# Create dataset primary key\n", "df.insert(0, \"PK\", \"\")\n", "df[\"PK\"] = (\n", " df[\"datetime\"].astype(str)\n", " + \"_I\"\n", " + df[\"idelem\"].astype(str)\n", " + \"_S\"\n", " + df['subarea'].astype(str)\n", ")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PK/Unique identifier check\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "✅ PK is unique\n" ] } ], "source": [ "print(\"PK/Unique identifier check\")\n", "if df[\"PK\"].nunique() == df.shape[0]:\n", " print(\"✅ PK is unique\")\n", " # As we passed the PK quality check, we can set this PK as dataframe index\n", " df.set_index(\"PK\", inplace=True)\n", "else:\n", " print(\"❌ PK is not unique\")\n", " display(df[df[\"PK\"].duplicated()][[\"idelem\", \"datetime\"]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Por tanto para esta `PK` queda perfectamente idetificado el tráfico en un identificador concreto a una fecha concreta." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exactitud y Completitud\n", "En primer lugar comprobaremos que para cada hora tenemos el mismo número de datos:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A continuación comprobaremos los valores nulos para cada variable" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "idelem: 0/853114 valores nulos\n", "descripcion: 51731/853114 valores nulos\n", "accesoAsociado: 138546/853114 valores nulos\n", "intensidad: 0/853114 valores nulos\n", "ocupacion: 0/853114 valores nulos\n", "carga: 0/853114 valores nulos\n", "nivelServicio: 0/853114 valores nulos\n", "intensidadSat: 51731/853114 valores nulos\n", "error: 1683/853114 valores nulos\n", "subarea: 0/853114 valores nulos\n", "st_x: 0/853114 valores nulos\n", "st_y: 0/853114 valores nulos\n", "velocidad: 801383/853114 valores nulos\n", "datetime: 0/853114 valores nulos\n", "date: 0/853114 valores nulos\n" ] } ], "source": [ "for col in df.columns:\n", " print(f\"{col}: {df[col].isnull().sum()}/{df.shape[0]} valores nulos\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La variable `velocidad` posee un 93% de valores nulos, por lo cual no nos va a aportar apenas información. Por tanto, decidimos eliminarla." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=\"velocidad\", inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Veamos si los valores nulos de `descripcion` coinciden con los valores nulos de `subarea`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "descr_null_pk = df[df['descripcion'].isnull()].index\n", "subarea_null_pk = df[df['subarea']=='nan'].index" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Los valores nulos coinciden en ambas columnas\n" ] } ], "source": [ "if (descr_null_pk == subarea_null_pk).all():\n", " print('Los valores nulos coinciden en ambas columnas')\n", " \n", "else:\n", " print('Hay valores incompletos')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Por tanto, eliminaremos estos valores " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df=df[df['descripcion'].notnull()]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "idelem: 0/801383 valores nulos\n", "descripcion: 0/801383 valores nulos\n", "accesoAsociado: 86815/801383 valores nulos\n", "intensidad: 0/801383 valores nulos\n", "ocupacion: 0/801383 valores nulos\n", "carga: 0/801383 valores nulos\n", "nivelServicio: 0/801383 valores nulos\n", "intensidadSat: 0/801383 valores nulos\n", "error: 0/801383 valores nulos\n", "subarea: 0/801383 valores nulos\n", "st_x: 0/801383 valores nulos\n", "st_y: 0/801383 valores nulos\n", "datetime: 0/801383 valores nulos\n", "date: 0/801383 valores nulos\n" ] } ], "source": [ "for col in df.columns:\n", " print(f\"{col}: {df[col].isnull().sum()}/{df.shape[0]} valores nulos\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De esta forma no tenemos ningún valor nulo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PROFILING 📑" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2e0722a6c9da4a18bf557526a190d6d3", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Summarize dataset: 0%| | 0/5 [00:00_\",\n", " \"date\": \"Fecha de la petición a la API\",\n", " \"datetime\": \"Fecha y hora de la petición a la API\",\n", " \"idelem\": \"Identificador del punto de medida. Permite su posicionamiento sobre plano e identificación del vial y sentido de la circulación\",\n", " \"descripcion\": \"Denominación del punto de medida\",\n", " \"accesoAsociado\": \"Código de control relacionado con el control semafórico para la modificación de los tiempos\",\n", " \"intensidad\": \"Intensidad de número de vehículos por hora. Un valor negativo implica la ausencia de datos\",\n", " \"ocupacion\": \"Porcentaje de tiempo que está un detector de tráfico ocupado por un vehículo\",\n", " \"carga\": \"Parámetro de carga del vial. Representa una estimación del grado\",\n", " \"nivelServicio\": \"Parámetro calculado en función de la velocidad y la ocupación\",\n", " \"intensidadSat\": \"Intensidad de saturación de la vía en veh/hora\",\n", " \"error\": \"Código de control de la validez de los datos del punto de medida\",\n", " \"subarea\": \"Identificador de la subárea de explotación de tráfico a la que pertenece el punto de medida\",\n", " \"st_x\": \"Coordenada X UTM del centroide que representa al punto de medida en el fichero georreferenciado\",\n", " \"st_y\": \"Coordenada Y UTM del centroide que representa al punto de medida en el fichero georreferenciado\",\n", " \"velocidad\": \"Velocidad medida\",\n", " }\n", " },\n", " interactions=None,\n", " explorative=True,\n", " dark_mode=True,\n", ")\n", "profile.to_file(os.path.join(ROOT_PATH, \"docs\", \"qa\", \"informo_report.html\"))\n", "# profile.to_notebook_iframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11.7 ('venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "7056b316d281e41af8bb677c2c15c6d2127a9166071d02d66c8bc69014a8260f" } } }, "nbformat": 4, "nbformat_minor": 2 }